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Abstract 



The first four chapters of this report primarily 
provide an extensive, critical review of the literature' 
with regard to selected aspects of the criterion- 
referenced and mastery, testing fields. Major topics 
treated include : (a) definitions , distinctions , and 
backg.'ound (b) the relevance of classical test theory , 

(c) validity and procedures for test construction, and 

(d) test reliability. 

Chapter V provides a treatment of criterion-refer- 
enced and mastery item analysis and revision procedures 
when items are scored in the classical correct/wrong 
manner. Chapter VI treats an alternative to the 
classical procedure for administering and 'scoring items. 
This procedure employs the subjective probabilities 
typically associated with confidence testing in order 
to obtain pseudo-classical scores . These scores , which 
have not been considered elsewhere, appear to be very 
useful for item analysis purposes \n that they have most 
of the advantages and few of the disadvantages of both 
classic il scores and subjective probabilities. 

Chapter VII provides an analysis of a set of data 
collected to illustrate many of the statistics and 
procedures discussed in Chapters V and VI, especially. 

One of the appendices to this report provides the 
manual for an extensive test scoring and item, analysis 
program hat uses student subjective probabilities as 
input . 
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CHAPTER I 
Introduction 



Rationale for and Overview of Research Reported Here 

In the last decade there has been considerable 
dlscusslcj, debate/ research, and development surrotindlng 
criterion-referenced testing and mastery testing. Inevi- 
tably, the Issues have been discussed from philosophical, 
theoretical, and practical points of view* Some persons have 
been primarily concerned about comparing these new testing 
techniques with norm-referenced techniques; some have argued 
tha^ there are no important differences among these tech- 
niques; others have argued that there are importemt differ- 
ences; and still others have assumed that there are important 
differences and proceeded from there# 

Thus, the modus operandi ariong researchers who have 
worked in the areas of criterion- referenced and mastery test- 
ing has differed considerably, and this is probably desirable, 
in general* However, this fact, the relative yo^th of these 
testing techniques, their apparent popularity, and their 
sonewhat unrestrained use have all interacted to confuse 
certain issues and to render very difficult an answer to the 
question, "What do we krow about criterion-referenced testing?'* 
Relatively little of what we know is found in textbooks or 
even in the popular journals that treat measurement and 
testing* As might be expected, some of the best work is 
found in unpublished manuscripts and reports* 

Thus, one purpose of thi£ document is to provide an 
overview of the literature on criterion-referenced and 
mastery testing, especially with regard to statistical 
measures^ criteria, and procedures for criterion-referenced 
reliability, validity, and item analysis* Any such review 
of the literature is bound to be somewhat biased by the 
author's subjective judgment, and it is virtually impossible 
to reference all the work performed in any of these areas* 
However, an effort has been made to identify the most impor- 
tant, or potentially important references* 

Another purpose of this report is to discuss procedures 
for identifying criterion-referenced and mastery test items 
that require revision* It is the feeling of this author 
that this is a f undarr.entally important topic, in that: 
(a)consideration of the problems involved here helps to 
clarify the important issues in criterion-referenced 
reliability and validity, and (b) any measurement technique 
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can only be valid and useful to the extent that the 
measuremetn instruj:ent, and, hence, the test items are 
at least minimally acceptable* 

The proposal that formed the initial statement of the 
research reported here also indicated tnat several different 
kinds of item administration and scorir.g proceuuxes would 
be considered in terms of their applicability for criterion- 
referenced and.nastery testing* At the time the proposal 
was writ tern, this author felt that the typical correct/ 
wrong scoring procedure for objective items left much to be 
desired, especially for many applications of criterion- 
referenced testing* In particular, this author felt that 
confidence testing, or one of its variants, might offer 
significant advantages to criterion-referenced testing. 
The research discussed in the later chapters of this report 
seems to support these beliefs. 



Definitions . Distinctions , and B ackground 

Norm-referenced testing * Measurement theory traditionally 
has been concerned with the accurate estimation and inter- 
pretation of an individual's score in relation to the scores 
of other individuals who have taken, or who might potentially 
take, a given test* In fact, many psychoraetricians have 
historically taken the position that "a test (that is) not 
dlscrininating among examinees *•* is not a useful measuring 
instrument (Lord and Novick, 1968, p* 252)*" However, it 
should be noted that very few psychoraetricians define 
measurement in a manner that necessitates this discriminating 
function of a test* (See, for example, the definition of 
measureinent provided by Lord and Novick, 1968, p* 17*) In 
other words, ^ historically , most psychoraetricians have con- 
cerned themselves with tests whose purpose is to maximally 
discriminate among subjects with regard to some underlying 
characteristic, trait, or construct; hence, the ability of a 
test to provide a basis for making statistical statements 
about the distinctions among students has, x*or many, become 
an operational definition of "useful measurement instrument*" 
Furthermore, tills point of view has necessitated that the 
interpretation of a student •s score be "•*. dependent on the 
relative pcsitior. of the score in comparison with other scores 
(?opham and Husek, 19b9, ?• 3)«" Tests of this kind are 
currently referred to as norm-referenced tests* The term 
"norm-referenced" is somev/hat inappropriate in that the 
"norm group*' in traditional test theory usually has a specific 
definition or connotation that is not necessarily consistent 
with the term "norm-referenced"; however, here, as elsewhare 
in this report, our concern is 'M±th describing terms in an 
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unar^biguous manner, not changing their names. 



Criterion^re f erenced testing — back/rround # Now, there 
can be no argunient about the usefulness of norm-referenced 
testing; however, researchers for a number of years have noted 
that certain purposes and uses ol tests do not fit very well 
into a norm-reierenced frainework* Flanagan (1951) and 
Gardner (1962) pointed out some distinctions between what are 
now called norm-refereaced and criterion-referenced tests; 
Ebel's (1962) work on "content standard scores" is also 
frequently referenced as a precursor to criterion-referenced 
testing* However, Glajei- (1963) and Glaser and Klaus (1962) 
are tho earliest references that specifically consider 
criterion-referenced tests, as such; the latter, in the 
opinion of this author, is still one of the best introduc- 
tions to the distinctions between norm-referenced and 
criterion-referenced tests* 

It is interesting to note the historical proximity 
between criterion-referenced testing and programmed instruc- . 
tion, which provided a motivating factor in the development 
of new instructional systems and educational technology* 
This chronological proximity is probably not mere coinci- 
dence, since, as Coulson adn Cogsv/ell (1965) note, changes 
in testing procedures are a natural consequence of changes 
in teaching\^-:ethod* In fact, most criterion-referenced testing 
is closely associated with some kind of instruction, espec- 
ially individualii:ed or adaptive instruction* (See, for 
example, Nitko, 1971, and Hambleton, 1973*) 

There ai^e several lessons to be learned from this 
frequently cccuring relationship between criterion-referenced 
testing and instruction* First, it should be noted that this 
relationsaip car. easily confound the interpretability of 
criterion-referenced measurer.ents* In fact, one of the 
difficulties \v±th most of the literature is a needless 
confusion of instruction and measurement This is not to 
say that instruction and measurement cannot and should not 
interact* This author has even stated elsewhere that many 
of the problems in instruction will not be resolved until 
fundJjnental issues in measurement cure adequately treated 
(Brennan, 1973b)* However, at least at the present time, 
in the opinion of this author, it is necessary to recognize 
the distinctions batween neasurement and instruction, if we 
are to advance the cause of eith::r* The relationship 
between criterion-referenced and instruction r.ay also provide 
an explanation f:r the rather uneven interest, if not the 
apathy, of many psychomotricians with rejrard to criterion- 
referenced testing From a practical point of view, good 
critericn-re f eren-red test data for a reasonably large number 
of subjects is quite rare, or not readily available to 
psychonetricians fur ai^.alysis* For one thin,^, the collection 

1-3 



ERIC 



and analysis of criterion-re fer'enced test data is often 
somewhat over- shadowed by the day-to-day exisencies of 
providing instruction to students^ Also, many psychometri- 
cians work in environments that remove their from the testing 
issuos that often arise in instructional contexts; hence, 
such psychometricians are often removed from the issues that 
motivate much of the work in criterion-referenced testing. 

Criterion - referenced testing ~ definitions . Many 
definitions of a criterion-referenced test have been pro- 
posed in the literature. For example: 

"A pure criterion-referenced test is one consisting 
of a ssunple of production tasks drawn from a well- 
defined population of performances, a sample that may 
be used to estimate the proportion of performances in 
that population at which the student can succeed 
(Harris^ and Stewart, 1971, p. 1)#" 

"A criterion-referenced test is one composed of iterra 
keyed to a set of behavioral o'^iectives (Ivens, 1970, 

"A criterion-referenced test is one that is deliberately 
constructed so as to yield measurements that are directly 
interpretable in terms of sDecified performance standards 
(Glaser and Nitko, 197l)J' 

The last definition seems to be the one that is most uni- 
versally accepted, the second is one of the most general 
definitions, and the first is one of the most specific. 
This author prefers the last definition; however, many 
criterion-referenced tests appear to satisfy all definitions; 
and, therefore, arguments alout the "best" definition may be 
of more cheorctical than practical concern. From another 
point 01 view, hov;ever, it should be noted that some tests 
which are criterion-referenced under one definition may not 
be criterion-referenced under another definition. 

Cri tericn - roferenced testing — charac t eristic s # Among 
the most frequently cited characteristics of a criterion- 
referenced test arc: (a) test items are associated with 
specific behavioral objectives, (b) the resulting meaoure- 
ment scale is an absolute, as opposed to, a relative scale, (c) 
a student's score is capable of being interpreted indepen- 
dent of the scores of other subjects, and (d) there is a 
specified behavior:^! criterion (or criteria) for acceptable 
performance* It is worth considering some of these character- 
istics in more detail»> 

From a practical point of viev/, criterion-referenced test 
items are al.^ost axv/ays clai':;cd to be associated with 
snecific objectives. However, thit. author does not believe 
that anything in Glaser and ::itkc»o (1971) definition of a 
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criterion-referenced test necessitates that criterion-refer- 
enced items must necessairily bo associated with the typical 
Iclnds of presently available, explicitly stated behavioral 
objectives* This Is not an argument against behavioral 
objectives; rather it is an admonition not to needlessly 
constrain the definition of a criterion-referenced test by 
demanding that criterion-referenced test items reflect 
particular kinds of behavioral objectives* Nevertheless, 
it is critical that some behavioral criterion for acceptable 
performance be specified* 

Qlaser (1963) discusses the issue of absolute versus 
relative standards in the follovrLng terms: 

"The scores obtained from an achievement test provide 
primarily two kinds of information* One is the degree 
to which the stud^ent has attained criterion performance, 
for example, whether he can satisfactorily prepare an 
experimental report, or solve certain kinds of work 
problems in arithmetic* The second kind of information 
that an achievement test score provides is the relative 
ordering of individuals with respect to their test 
performance, for example, whether student A can solve 
his problems more quickly that student B* The principal 
difference between these two kinds of information lies 
in the standard used as a reference* What can be 
called criterion-referenced measures depend upon an 
absolute standard of quality, while whet can be termed 
norm-referenced measures depend upon a relative 
standard (Glaser, 1963, P» 2)*'* 

We stated above that norm-referenced tests are speci- 
fically constructed to yield scores that allow for maximum 
discrimination among subjects* More precisely, such measures 
are intended to provide a basis for making distinctions 
among subjects over a continuum of ability* £venthough the 
interpretation of a subject's criterion-referenced score is 
independent of the scores obtained by other subjects; it is 
not quite true to say that all criterion-referenced tests 
are not intended to identify differer.ce« among subjects* 
However, the differences to be identified are of a very 
specific nature* That is, a criterion-i ef erenced test is 
often intendea to distinguish between two groups of subjects: 
those who have and those who have not achieved the specified 
performance standard* For the most part, criterion-referenced 
tests that have thi^i intended purpose fall into the realm 
of mastery testing* 

Thus, norm-referenced and criterion-referenced tests 
differ with regard to the desired nature of the discrimi- 
nations among subjects* An analogy may help clarify this 
point* In describing the length of a table, I may say that 
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It is either greater than or not greater than six feet long; 
or I may say that it is longer than 80 percent of the tables 
in the school cafeteria* The latter is analogous to the norm- 
referenced kind of discrimination* Also, note that the 
criterion->ref erenced statement makes use of an absolute 
measurement scale, while the norm->ref erenced statement does 
not* 

Mastery testing * That part of Glasa: and Nitko^s (1971) 
definition of a criterion-referenced test that refers to a 
••specified performance standard" has been a subject of con- 
siderable confusion and misunderstanding in the literature* 
The standard should be specified and it should be amenable 
to measurement of some kind (hence, the word ••performance^*); 
but the standard need not be a single score, the standard 
need not be high, and certainly the standard heed n.t be 
perfect mastery, or anything close to perfect mastery* Now, 
it is often true that the standard chosen is a single ••high^^ 
score, and, thus, in many cases, there is little operational . 
difference between a criterion-referenced test and a mastery 
test; however, the difference between these two kinds of 
tests is a potentially real and important one* This distinc- 
tion should be recognized even if, in particular circum- 
stances, the distinction is not made* For example, the tests 
used in the National Assessment Program (see Merwin and 
Womer, 1969) can be considered to be criterion-referenced, 
but they are not a typical example of mastery tests* In 
this report we will make the distinction between criterion- 
referenced and mastery testing when we deem it to be 
critical; otherwise, we will use the term "criterion- 
referenced" instead of ••mastery," since the latter is a 
special case of the former* 

The impetus for, and original work in, mastery testing 
was presented by Bloom (1968) as part of a general model for 
mastery learning* Perhaps the best-known references on the 
topic ar3 Bloom (1971) and Block (1971)* The latter presents 
a review of the literature which has recently been updated by 
the same author (Block, 1973)* From a measurement viewpoint 
the outstanding issue in a discusaiun of mastery testing is 
the cut-off value (cutting score, passing score, mastery 
cut-off, or criterion) chosen as the basis for classifying 
stuents as masters or non-masters* Emrick (1971)f 
Kriewall (1969), and Millman (1972) have all treated this 
issue to some extent* However, there seems to be a subtle 
difference between Glase?and Nitko's "specified performance 
standard^^ and the basis upon which some persons recommend 
choosing a mastery cutting score* At least sometimes, the 
mastery cutting scores appear to be based partially upon 
characteristics of the test score distribution* Such proce- 
dures, in the opinion of this author, run the risk of 
confounding the definition of mastery with the irrelevant 
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information provided by the test score distribution* For 
example, taken to extremes, such a procedure might, after 
the fact, classify all persons above the median as masters, 
in which case, mastery learning is guaranteed to be effec- 
tive (and not effective) for fifty percent of the students. 

The word criterion Another issue with regard to 
general terminology and background for this report concerns 
the word "criterion." This word, for some time, has had 
several denotations or connotations in test theory; and 
with the advent of criterion-referenced and mastery testing 
the potential ambiguities have increased. In classical 
test theory, the word "criterion" usually refers to some 
external measure that provides a standard against which 
a particular test is compared; in this sense, the word 
"criterion" is often associated with criterion, statistical, 
or empirical validity (Brown, 1970). Also, in both classical 
testing and mastery testing the word criterion is sometimes 
synonymous with a cutting score, cut-off score, or "accepta- 
ble" score magnitude. In criterion-referenced testing, the 
word "criterion" refers generically to "the standard (or 
criterion) against which a student^s performance is com- 
pared (Glaser, 1963, ?• 519)." Nitko (1971) discusses these 
distinctions in somewhat greater depth. Once these distinc- 
tions are recognized, the context of a given discussion 
usually resolves any ambigtxities. 

Criterion -referenced tests and scores . It Is almost 
inconceivable that a z-score, i-score, stanine, or per- 
centile rank would be a criterion-referenced score, whereas 
"number of items correct" or "proportion of items correct" 
might be. Nevertheless, the actual student score reported 
IsV of itself, never sufficient to warrant saying that the 
score is criterion-referenced. Such a statement can be made 
only if the test is (or can be Interpreted as) i criterion- 
referenced test and the score reported reflects, the specified 
performance standard. 

Merely viewing a test is not sufficient to identify it 
as criterion-referenced or norci-ref erenced; one must also 
know the manner in which it was constructed, the purpose 
for which it will be used, and the way in which student 
scores are constructed and interpreted. Furthermore, 
practically any test has the potential for being either 
criterion-referenced or norm-referenced. 'Thus, for example 
Ebel (1962) has suggested a procedure for deriving criterion- 
referenced information from a norm-referenced test^ 

Ebel's limitatlons ^ of criterion - referenced testing . 
Surprisingly, Ebel is also a generally vocal critic of 
criterion-referenced testing. In Ebel^s view, the major 
limitations of criterion-referenced tests are: "(1) they do 
not tell us all we need to know about achievement, (2) they 
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are difficult to obtain on any sound basis, and (3) they are 
necessary for only a small fraction of important educational 
achievements (Ebel, 1970, p. 8)*" The last objection is, 
I think very ambiguous in that Ebel does not define what 
he means by "important educational achievements*'* As a 
counterexample, in my experience, most teachers and those 
working with instructional systems, when asked to character- 
ize a "good" test for their pxxrposes, invariably list 
characteristics of criterion-referenced tests. Ebel^s first 
"limitation" is, at best, misdirected in that very few 
researchers would want to argue that any testing technique 
is likely to be sufficient in providing us with "all we need 
to know about achievement (italics ours)«" £bel*s second 
limitation is at least partially true, but it is not true 
that is prohibitively difficult to obtain good criterion- 
referenced measurements^ Finally, in the opinion of this 
author, Ebel's three supposed limitations of criterion- 
referenced measurement are equally, if not more, appropriate 
comments about norm-referenced measurement* But even if one 
agrees that Ebel's statement of limitations is valid, this, 
in itself, is not a Justification for eliminating the use 
or development of criterion-referenced testing^ as some 
might claim* The issue is not which kind of testing is 
better, but rather, which kind of testing is appropriate, 
mider what circumstances, and for what purpose* 

A Model for the Use of Achievement Data and Time Data in 
an Instructional System 

Since criterion-referenced and mastery testing are often 
closely associated with an instructional system, it is 
desirable to consider the role of these testing techniques 
in an instructional system; at the same time it is useful 
to to consider the potential role of norm-referenced testing 
in an instructional* In this section, we briefly consider 
these issues; the reader is referred to Brennan (1973a) for 
a more complete discussion* 

Here we restrict ourselves to a consideration of 
achievement data and time data for evaluating the cognitive 
aspects of an instructional system* Given the current state- 
of-the-art. one might argue that achievement data and time 
data often' provide the most useful and interpretable infor- 
mation with* regard to decision-making in an instructional 
system; nevertheless, it should be noted that a complete 
evaluation of an instructional system necessitates the col- 
lection and use of other types of data, as well* 

Ob,1ectlve*related nodules * One reason that so much of 
the literature on evaluating particular instructional systems 
lacks generalizability to other instructional systems is that 
the unit of analysis for the purpose of collecting data and 
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making decisions is apt to vary- considerably from system to 
system; and, often enough, the xrnit of analysis varies even 
within the sane system. 

In some systems the unit of analysis is merely the amount 
of Instruction that occurs in some specified time period; in 
other systems the unit of analysis corresponds with the in- 
struction for some group of objectives which are taught 
together in ^ome sequence for pedagogical reasons* In both 
of these cases, the unit of analysis corresponds with 
obvious physical characteristics of the system, and, there-- 
fore, the unit of analysis typically involves a number of 
different instructional objectives* However, the kinds of 
decisions that must be made in evaluating and revising an 
instructional system necessitate a consideration of all of 
the data and Instruction relating to each separate objective, 
no matter r;hen the data are collected or where the instruc- 
tion occurs within the system* 

In short, the basic unit of analysis in an instructional 
system should be the objective* In order to emphasize this 
fact and facilitate the collection and analysis of data for 
decision-making, it is theoretically and practically useful 
to view an instructional system as consisting of a discrete 
number of ob.j ective-related modules* As employed here, the 
phrase "objective-related module" refers to all of those 
factors in an instructional system that are directly related 
to a particular instmctional objective* Note especially that 
the term "module" is not used here as a descriptive character- 
istic of the physical layout of an instructional system* 
The central aspects of an objective-related module are the 
objective itself and the instruction intended to teach the 
objective* In addition, an objective-related module contains 
all of the data directly relevant to the particular objective* 

This conception of an instructional system in terms of 
objective-related modules may appear too theoretical or too 
trivial, at first glance; however, for purposes of evalua- 
tion, the concept of aii objective-related module has several 
advantages over many other ways to outline and describe 
. an instructional system* First, and most importantly, this 
concept directly implies that the objective is the basic unit 
of analysis in an instructional system* Second, the objec- 
tiVe-related module concept emphasizes the relationship 
between the objective, instruction, and data*^ Third, any 
Instructional system can be described in terms of objective- 
related modules, regardless of how the instruction is 
sequenced or packaged* Fourth, the objective-related module 
concept greatly facilitates an understanding of many of the 
issues and problems surrounding the collection and use of 
data in instructional systems* 
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Purposes of data collection ^ Any discussion of data 
immediately raises two questions: for what purpose should 
such data be collected and what kind of data should be 
collected? Here we restrict the scope of these two ques- 
tions to the domain of evaluating cognitive achievement in 
an instructional system* 

In general, of course, one can say that data is collec- 
ted in the environment of an iastructionctl system for the 
purpose of evaluation, where, according to Stufflebeam (1971) 
"evaluation is the process of delineating, obtaining, and 
providing useful information for judging decision alter- 
natives 267)." 

Mprd specifically, one could say that data should be 
collected for the purposes of diagnostic, formative, and 
summativo evaluation tBloom, Hastings, and Madaus, 1971 )• 
If one considers evaluation as a decision-making process, 
then the diagnostic- formative-summative trichotomy refers 
primarily to potential decision-making fiinctions of evalua- 
tion* However, we prefer to emphasize that the purpose of 
collecting data in the environment of an instructional 
system is to make decisions with regard to specific aspects 
of the instructional system, namely: (a) instruction, 
(b) students, and (c) test items* That is, we prefer to 
emphasize the object of the decision-making process, as 
opposed to its function* Emphasizing the object of the 
decision-making process seems to identify more clearly the 
specific nature of the decisions that typically need to be 
made in an on-going instructional system* 

Decisions about instruction are usually of primary 
importance; i*e*, one wants to assess the effects of instruc- 
tion especially for the purpose of identifying instruction 
that requires revision* Such decisions are often viewed as 
part of the process of formative evaluation. In order to 
make decisions concerning whether or not instruction should 
be revised, we argue here that data should be obtained 
which can be used to determine instructional effectiveness, 
efficiency, and retention* 

Decisions about students typically include decisions 
concerning student placement and certification* Such 
decisions are often viewed as part of the processes of 
diagnostic and summative evaluation, respec tively# 

Decisions about test items also need to be made in 
instructional systems* Specifically, one needs to deter- 
mine the reliability and validity of tests used as part of 
the instructional system* 

Types of data. One can identify at least eight differ- 
ent types of data for an objective-related module that pro- 
vide meaningful sources of information for decision-making* 
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These types of data listed in the order in which they would 
usually be obtained, are as follows: 

(a) Prerequisite test data, which indicates whether or 
not a student has the background characteristics (attainment 
of previous objectives, aptitude, etc») thought to be neces- 
sary in order to achieve the objective for the module; 

(b) Pretest data, which measures a etudent^s performance 
on the objective prior to instruction; 

(c) Instructional tine, which is the length of time a 
student spends undergoing instruction for the objective; 

(d) Criterion-referenced posttest data, which measures 
a student^s performance on the objective immediately after 
instruction; 

(e) Norm-referenced posttest data, which is collected 
immediately after instruction and measures student perform- 
ance relative to the performance of other similar students; 

(f) Retention time, which is the length of time inter- 
vening between the posttest (usually criterion-referenced) 
and a subsequent retention test (usually criterion-refer- 
enced; 

(g) Criterion-referenced retention test data, which is 
collected some time after instruction and measures student 
performance on the objective for the module; and 

(n) Norm-referenced .retention test data, which is 
collected some time after instruction and measures student 
performance relative to the performance of other similar 
students* 

It is often assumed that only criterion-referenced 
or mastery test data provide raeaninsful information for 
evaluation decisions v/ith regard to instructional systems* 
Certainly criterion-referenced data is more" important that 
norm-referenced data in the context of an instructional 
system; however, norm-referenced data sometimes provides 
useful additional information for decision-making (see 
Brennan, 1973a, for more detail concerning this issue)* 

A table for relating: data tyye an^ use * These data 
for an objective-related moaule are displayed in Table 1*1 
which, in addition, indicates those types of data that are 
of primary importance for making decisions with regard to 
instruction, students, and test items* In essence. 
Table 1#1 provides a kind of taxonomy of achievement and 
time data that are useful in evaluating instructional 
systems* It is of course quite possible that a particular 
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objective-related module* may not contain all of the data 
indicated in Table It ie alto possible that, in a 

particular objective-related module, a teat may, in fact, 
consist of only one item* Clearly, when not all of the above 
data are avaiilable, not all of the decisions indicated in 
Table can be made* 

Observations from Table 1-1 * Viewing our data in 
the manner indicated in Table 1-j illustrates and reinfor- 
ces the following observations^ 

(a) Decisions regarding instructional effectiveness 
necessitate a consideration of both pre- and posttest data* 
Decisions regarding mastery and/or grading involve a 
consideration of either pretest or posttest performance, 
but not both — at least not in typical circumstances^ 

(b) Decisions regarding the efficiency of instruction 
necessitate a consideration of instructional effectiveness 
and the instructional time intervening between pre- and 
posttest* The importance of instructional tio\e in learning 
has been treated by Carroll (1963, 1973); in fact, this 
issue is one of the primary motivating factors in Bloom^s 
mastery learning model* 

(c) Norm-referenced tests can serve a useful function 
in grading students* This author suggests that a student's 
grade be based on both norm-referenced and criterion- 
referenced information, since grades seem to be used as 
both a measure of what a student knows and as a measure of 
how much a student knows compared to what other students know* 

(d) Decisions about validity and reliability are 
relevant- for all kinds of tests* Furthermore, decisions 
about validity and* reliability should be made for the 
"change" scores indicated by instructional effectiveness 
and retention* 

These and other points concerning the issues raised 
by Table 1-1 are treated in much greater depth 
by Brennan (1973a)* 
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CHAPTER II 



Clas&ical Test Theory 
in Criterion ^ Ref erenced Testing 



Background . 

There seems to be some question in the minds of 
some researchers concerning the applicability of the 
classical test theory model to criterion-referenced 
testing. This is a potentially serious concern in that, 
if the assumptions of classical test theory are not met 
be criterion-referenced tests, then psychometricians 
evaluating such tests have virtually lost -the benefit of 
over fifty years of test theory development. Even if 
criterion-referenced tests meet the assumptions of clas- 
sical test theory, this is not a guarantee that the 
classical results and theorems form a sufficient theo- 
retical basis for criterion-referenced tests; however, 
this problem is not nearly as serious as the problem of 
assumptions. Some aspects of the applicability of 
classical test theory have been discussed by Pophain and 
Husek (1969) , but they have not discussed the validity 
of the assumptions of classical test theory in rela- 
tion to criterion-referenced tests. Therefore, it seems 
appropriate to analyze the assumptions of classical 
test theory in order to determine if any of these assximp- 
tions are not met by criterion- referenced tests. 

In the author's opinion, the assumptions that we 
will discuss are often misunderstood or not fully 
appreciated. For example, many educators seem to have 
a virtual psychological fixation on the normal curve; 
Such educators tacitly assume that the normal curve is 
a ' sine qua non for classical test theory. As will be 
shownm however, this is not the case — the assumptions 
of classical test theory are distribution free. 

Notation 

Unfortunately, there is no standard notation used 
by all writers who work in the field of classical test 
theory. The notation used by Gulliksen (1950) is simple, 
but not always sufficient; the notation used by Lord 
and Novick (1968) is very precise but perhaps more 
complicated than necessary for most researchers and 
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practitioners. Therefore, an effort will be made to 
combine the most favorable aspects of both schemes of 
notation in the hope that the reader will be able to 
apply the adapted notation scheme to both of the 
above basis references^ 

Let X . be the observed score for the i*th person 
on test g,^ whfc:re K is the total n\imber of items 
on test 9 and N is ^ the total number of persons; i.e.. 



j 



g 



where ^gf j is the score on item j of test 9 for person i. 

Let T . be the true score for the i-th person on 
test g. ^ The assumptions that will be dtaced later 
serve to define what we mean by "true score." One of 
the theorems that can be proved is that, in the classical 
test theory sense, the crue score is the expected value 
of the observed score. 

Let E . be the error score for the i-tK person on 
test g. ^ ^ called the "error of measurement," is 
the result of ^ various chance or random factors that 
ca^'ase a person to answer correctly items he does not 
know or to answer incorrectly items he does know. Note 
tliat the errors accounted for by E . are chance errors, 
not systematic errors. ^ 

It is worth noting that T . is a fixed quantity for 
person i, but X , and E . are ^ random variables. If 
the same person^ were ^ given the same test a n^omber 
of times, and if after each testing the person's "brains 
were washed" we would expect that the person's observed 
scores and error scores wou] i show some variation; 
however, the person's true score is constant by defini- 
tion. 

Assumptions 

The following asscjtiptions express the posited 

relations between X^. , T_. , and E . . 

gi' gi gi 
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Alt 



Definition of Random Error 



E.»X.-T,orX.«T.+E.; 
gi gi gi gi gi gi' 

A2: Zero Average 'Error 

CvE ,) « 0 in every non-null subpopulation of 
^ persons; 

A3: Zero Correlation between True Score and 
Error Score 

A4: Definition of Parallel Tests — parallel 
tests f ,g, and h are defined as tests for 
which 

(i) T,, » Tg, - T^, , 

(ii) o^(Ej) » o^(Eg) - o^(Ej^) , And 

(iii) P(Tfi,T^i) = 9iT^.,T^.) = P.(Tg.,Tj^i) ; 

A5: Zero Correlation between Errors on Paralle] 
Tests 

p(Eg^,Ej^^) = 0 for parallel tests g and h; 

A6 : Zero Correlation berween Errors on One Test and 
True Scopes on a Parallel Test 

P (E T, . ) a 0 for parallel tests g.and h, 
gi ni 

In the above assumptions Al - A6, ^ indicates the expected 
value over persons in the subpopulation of persons under 
consideration, p indicates the correlation in the popula- 
tion of persons, and a indicates the standard deviation 
in the population. The subscripts f, and h are 
reserved for parallel tests. 

The seL uf assumptions Al - A6 is actu&lly more than 
sufficient. For example, Gulliksen (1950, pp. 6-13) does 
not list A4 (iii) and A6 as assumptions, since it is 
possible to prove both of these relations from the other 
assumptions , 
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Assumptions Al - A6 are primarily based upon a 
consideration of the errors of measurement E . • It is 
also possible to state the above assumptions^^ in 
terms of the true scores, but the assmnptions then become 
somewhat more difficult to understand, and the resulting 
theorems necessitate more complicated derivations. 
Furthermore, we wish to concentrate upon errors of 
measurement since they form the crux of several 
arguments presented later in this chapter. 

Let us now analyze the meaning of these assumptions. 

Assumption Al — definition of random error. 
Assumption Al postulates a linear relationship between 
the observed, true and error scores for a (randomly 
chose) person i. We are, in effect, saying that error 
score is the simple difference between true and observed 
score. Since, however, only X . is directly observable, 
the linear relationship ^ contains two unknown 

quantities and is, therefore, undefined without additional 
information. 

Assumption A2 — zero average error . Assumption 
A2 states that given any non-null subpopulation of 
persons (where the population is countcU^ly infinite) 
the expected value of the error scores ever persons is 
zero. In practice, the larger the number of cases in the 
distribution, the closer this assumption will be approx- 
imated. This assumption implies two important results: 

(a) In the entire population of persons, the 
expected value of error scores is zero, and 

(b) In every subpopulation consisting of persons 
with the same true score, the expected value of the 
error scores is zero. 

The latter result may be written mathematically as: 

C (E . I T ) = 0 ; 

gi ' g 

i.e., the expected value of E . for given true score T 
is zero, or the regression of^"^ error scores on true ^ 
score is a horizontal straight line passing thi-ough 
the origin. 

Assumptions Al and A2 serve to define what is 
meant by true score. Also not that these assumptions 
i'Aply that 
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Asamnption A3 — zero correlation between true scores 
and error scores . Assumption A3 states that the correla- 
"tlon between true and error scores in the population 
is zero. This means that we assunve that there Is no 
reason to expect positive (negative) errors to occur 
more frequently with high (low) true scores than with 
low (high) true scores. Note that Assumption A3 does 
not mean that error scores are distributed independently 
of true scor^. If errors are uncorrelated, "this 
merely means that the product-moment correlation is 
zero; if they are independent, this means that the 
frequency distribution of errors of measurement is the 
aama regardless of the examinee's true score (Lord, 
1959a, pp. 331)." 

Assumption A4 — def ini tion of parallel tests . 
Assumption A4 serves to define what we mean by parallel 
tests. Parallel tests are tests that have (i) the 
same true scores, (ii) the same population variances, 
and (iii) identical intercorrelations . Of course, if 
there are only two parallel tests, then (iii) becomes 
meaningless. The concept of parallel tests may initially 
appear to be of secondary importance; however, parallel 
tests play an impor: ^nt role in classical test theory. 

Assumption A5 — zero correlation between errors 
on parallel tests . Assumption A5 states that the popu- 
TaTtion correlation between random errors of measurement 
on parallel tests is zero. 

Assumption A6 — zero correlation between errors 
on one test and true scores on a parallel test . Assump- 
tTon A6 states that population correlation between 
random error scores on one test and true scores on a 
second parallel test is zero. 

Classical Assumptions and Criterion - Referenced Testing 

We stated in passing that the assumptions of clas- 
sical test theory are distribution-free, then we went 
on to discuss each of these assumptions. Note that none 
of the above assumptions necessitate any knowledge about 
the distribution of observed, true, or error scores. This 
fact, in addition to the very general nature of the 
assumptions themsel'^^es , seems to argue quite strongly that 
the classical test theory model is appropriate for 
criterion-referenced tests. At least, this author knows 
of no definition of criterion-referenced testing that; 
de facto, involves a violation of the classical test 
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theory assumptions. 



There is, however, at least one potential problem 
with the classical test theory assumptions in certain 
criterion-referenced and mastery testing situations. 
Consic r the subset of persons whose true score equals the 
highest possible true score for the test under consi- 
deration. From Assumption A2 we know that the expected 
value of the error scores for persons with the highest 
possible true score must be zero. In order for this to 
be true, positive and negative errors must be offsetting, 
but the highest possible true score equals the highest 
possible observed score. Therefore, all errors about 
the highest possible true score must always be zero. 
This conclusion seems difficult to support. It is 
somewhat analogous to saying that brilliant people are 
naver subject to chance or random errors in their field 
o£ expertise. 

Incidentally, this reservation about the classical 
test theory model is theoretically valid in norm- 
referenced testing situations as well as in criterion- 
referenced testing situations. However, in most norm- 
referenced testing situations the probability that a 
person will have the highest possible true score is 
rare; whereas this is not always true in criterion- 
referenced and mastery testing situations. Thus^ the 
above reservation is potentially more serious for 
criterion-referenced tests than for norm- referenced tests. 
In either case, however, this author is not convinced 
that the reservation noted above is a devastatubg criti- 
cism of the classical model. Perhaps models can be 
posited that obviate this problem, but, in the meantime, 
the classical test theory model seems to provide a 
reasonable initial model for considering criterion- 
referenced tests. 

It is useful to keep in mind three facts about the 
classical model: (a) X = T + E, (b) errors are random 
errors, and (c) a person's true score is the expected 
value of a large number of observed scores for that 
person. We wish to note one other aspect of the 
classical test theory assumptions. None of the assump- 
tions necessitate that items be scored in the usual 
correct/incorrect manner. We will refer to this scoring 
procedure as the "classical scoring procedure"; however, 
the classical scoring procedure is not a necessary 
condition for classical test tneory. 
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W.^ak and Strong True Score >^iodels 



Nc matter how one applies the classical test theory 
r.cdel, it is clear that its assumptions do not consti- 
tute a very strong statistical framework for evaluating 
any test, in fact, the classical ^«;«nTnntions together 
with theorems that can be proven from these assumptions 
constitute a "weak" true score model, weak in the sense 
that all results are distribution-free (lord, 1965). 
However, it is a truism in mathematics that the weaker 
the assiimptions , the weaker the results (Novick, 1966), 
In the last ten years, therefore, psychometricians 
such- as Lord, Keats, and Novick have attempted to develop 
soms strong true score models for tests, (Lord and 
Novick, 1968, is perhaps the best reference for the 
currently available strong true score models,) These 
models make stronger assumptions than Al - A6 in the 
previous section, and the results that can be derived 
are likewise stronger. 

Actually, one true score model (although it is 
seldom called a "true score model") has been in vogue 
for a considerable length of time. Many researchers 
(e.g., Gulliksen, 1950) have noted that in order to 
make use of the errors of measurement, it is. necessary 
to make certain assumptions about the distribution of 
these errors (Gulliksen, 1950, p. 17). In the typical 
test theory situation it is usually assumed that these 
errors of measurement are normally distributed, inde- 
pendently of thf> true score, with mean zero in the 
population and constant population variance. It is 
primarily these assumptions that have led some naive 
users of classical test theory to mistakenly assume that 
classical test theory relates only to normally 
distributed test scores. 

Characteristics of Errors of Measurement 

It is instructive to consider the implications of 
. only the classical assumptions upon errors of measure- 
ment. Assumptions A2 and A3 imply that 

C(Eg. I Tg) = 0, 

i.e., the regression of errors on true score is linear. 

More specifically, the best fitting line (in a least 

squares sense) is a horizontal straight line passing 

through the origin. Note that this does not mean that 
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error scores are distributed independently of true scores; 
i.e., the distribution of error scores around any given 
true score is not necessarily the same. This means that 
the errors of measurement are unbiased; it does not mean 
that the variances of the errors of measurement around 
the true scores are equal. 

Table -2*1 represents the observed scores and error 
scores for three true scores, where we assxime that there 
are only three true scores and there are only as many 
people in the population as there are observed (or 
error) scores. Figure 2-1 represents the regression of 
these error scores on the true scores. The regression 
line is the line identical with the T-axis. Note that it 
is clearly true that C (E .) =0 since f;(E ^1 T ) « 0 
for every true score T .^^ Since the ? ^. 
♦-regression line is hor?zontal, its slope is zero and 

consequently p{T ^ai^ also zero. Finally, note that 
the variances ^ ^ of the errors of measurement 
around the different true scores are not equal. 

Now, it can be shown that 

implying that the regression of observed scores on 
true scores is also linear (Lord and Novick, 1968, 
p. 65). Moreover, this regression line passes through 
the origin and its slope is equal to unity. Figure 2*2 
shows a graph of this regression line for the data given 
in Table 2-1. 

Neither one of the above regressions (represented 
by Figures 2-1 and 2*2, respectively) is however, the 
primary regression of interest. The test evaluator is 
usually primarily concerned about C (T ^ | X ), the regres- 
sion of true scores on observed ^ " scores, in 
order to estimate a student's true score from his 
observed score. However, this regression can be non- 
linear, and consequently neither the true score nor the 
observed score distribution is necessarily normal (Lord 
and Novick, 1968, pp. 500-505). 

Figure 2-3 shows a plot of the distribution of true 
scores (ordinate) versus the distribution of observed 
scores (abscissa) for the data in Table 2-1. The six 
circled points in Figure 2*3 represent three sets of 
(two) observed scores that map into different true 
scores. For example, according to Figure 2-3 an observed 
score of 12 can indicate a true score of either 10 or 20. 
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Table 2-1 

The Relation Between True, Observed, and 
Error £cores (Synthetic Data) 



True Observed Error / i \ / i \ 
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Figure 2-^1 
The degression of Error 
Scores on True Scores 
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20 



^ True 
« Scores 



5 w - 0 



Note. — The data for the above figure are given in 
Table 2*1. The abscissa represents true score T and 
the ordinate represents error score E. Note that the 
variances of the errors of measurement about each of 
the true scores are not equal. 
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Figure 2-2 



thm Rftgresslon of Observed Scores on True Scores 




True Scores 

Note,— The data for the above figure are given In Table 
2^1» The abscissa represents true score T and the ordinate repre- 
sents observed score X. Note that the variances of the errors of 
neasurement about each true score are not equal. 
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Figure 2^3 

Thtt Rsgression of True Scores on Observed Scores 
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Note. — ^The data for the above figure are given in Table 2-1. The 
ordinate represents true score T and the abscissa represents observed 
•core X. The circled points represent three sets (pairs) of observed 
scores that map into different true scores. For example, an observed 
score of 12 can indicate a true score of 10 or 20. 
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A similar situation occurs for observed scores of 14 and 
28* The fact that certain observed scores do not map 
into unique true scores indicates that C(T .| X ) is 
not linear. If this is not clear, * ^ ^ 

consider the following: 





6 < 




11) = 10 , 
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« 12, 


14) = 15 , 




X 

g 


« 13) 


- 10 , 




16 


< X 

- g 


< 26) - 20 , 




X 

g 


» 28) 


» 25, and 




29 


< X 

- g 


< 3 2) - 30 



The above expectations certainly do not constitute a 
linear function. A linear best-fitting regression line 
could be forced to fit thedata, but this would not be 
"the" best-fitting curve for the data; "the" best-fitting 
curve would be curvilinear. 

Norinal Error Model 



The results shown in Figures 2-1, 2*2, and 2*3 are 
based only upon the assumptions of classical test theory. 
Consequently, these results are distribution-free. Note 
expecially that these results do not make any assumptions 
about the distribution of the errors of measurement. 
Gulliksen (1950, p. 17) notes that in order to make use 
of the errors of measurement we must make some assumptions 
about the frequency distribution of these errors. The 
assumptions that we will now discuss form the rationale 
behind the theory for norm-referenced tests that, rely 
upon normally distributed true and observed score distri- 
butions . 

When, in addition to the assumptions of classical 
test theory, we assume that (a) the errors of measurement 
are distributed independently of true score, (b) the 
errors of measurement are distributed normally with 
mean zero and constant variance and (c) the regression 
of true scores on observed scores is linear, then both 
the true and observed scores must be normally distri- 
buted (Lord and Novick, 1968, p. 503) with 

^2 ^2 2 
^T " - 
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In practice, the lastassumptioi: (the linearity of 
the regression of true on observed scores) is often 
neglected (Gulliksen, 1950, Section 2.11), without 
serious difficulty. It can be shown that the linear 
regression of true scoxes on observed scores is given 
by 

^^V' ^g^ ^ " ^xx^^x ^ ^XX^g ' 

PXX • ^TX = 



^XX * ^XT 

In both cases is called the reliability coefficient 

(which is also the correlation between parallel 
measurements) . 

Assxunptions (a) and (b) are indicated in Figure 2—4. 
Note the difference between the distributions of the 
errors of measurement indicated in Figures 2^1 and 2-4 • 
In both figures, C (E . | T ) « 0, but in Figure 2-'4 
the errors of 9 9 measurement have a specific 

distributional form (i.e., the saune normal distribution) 
for each true. score T_. 

g 

Assumptions (a) , (b) , and (c) above thus provide 
the basis for a strong true score theory of test scores 
that results in normally distributed true and observed 
scores. In practive, these assumptions are the ones 
most frequently made (either consciously or unconsciously) 
about errors of measurement. These are the assumptions 
that make it possible to evaluate and interpret most of 
the currently available norm-referenced tests. 

The Relevance of Normality Assumptions to Criterion - 
Referenced Tes t s 



There is no doubt that the normality assumptions 
presented in the previous section are very useful for 
many testing purposes; however, these assumptions do not 
seem to be applicable for many criterion-referenced 
testing situations. Lord states: 

The assumption that each error is distributed 
2 

N(0,a ) independently of true score is probably 
quite adequate for many purposes. However, it is 
clear that these assumptions cannot be met when the 
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Figure 2-4 
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Thm Assumption of N(0, T^) Distributed 
Errors of Measurement 

w 

E 




True Scores 



Note. — Id the above figure ve assume that there are ^nly 

three possible true scores on a (hypothetical) test. The errors 

of measurement about each true score are normally distributed with 

2 

mean zero and constant variance (T-, 
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true score, expressed a proportion of the 
number of items in a test, is near zero or near ' 
one If n is the number of teat items, and r/n 
is some small number like .01, it is intuitively 
obvious, in view of the fact that the observed 
test score can never be negative, that the 
distribution of the errors of measurement will in 
all probability be skew, and that the standard 
deviation of this distribction will surely be less 
than if the true score were not so near to zero 
(Lord, 1959b, p. 475). 

Similazly, if the true score expressed as a proportion 
of the number of items correct is near unity, then the 
distribution of the errors of measurement will be less 
than if the true scores were not so close to unity. If, 
in either case, it were assumed that the errors of 
measurement were distributed independently of true score 
with constant variance about each true score, this v^ould 
imply that certain (postulated) observed scores would, 
in fact, be unobtaineible. (See Figure 2-'5.) Lord (1960) 
discusses in some depth the consequences of assuming 
that errors of measurement are distributed 

N(0,a^) 

independently of true score. 

Since, for many criterion-ref er^'^.bed tests, many 
of the students get most of the items correct, it is 
obvious that we often expect the true proportion of 
items correct for at least some student to bo near unity. 
Thus, on the basis of the arguments presented above, it 
should oe clear that the normality, constant variance, 
and independence assumptions presented in the previous 
section are not always applicable for criterion-referenced 
tests. The next section describes assumptions that are, 
however, quite appropriate for such tests. 

The Binomial Error Model 

Recall that the normal error model assumes that 
(a) the errors of measuremenc are distrib.uted independently 
of true score -and (b) the errors of measurement are 
distributed normally with mean zero and constant variance 
for each true score, i.e., 

N(0, a^) . 

In the previous section we demonstrated that neither 
(a) nor (b) is reasonable for at least some criterion- 
referenced tests. riius, for such tests we are forced to 
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Figure 2-5 

2 

A Consequence of Assuming N(0, 0*-.) Distributed 
Errors of Measurement 



X 




10 20 30 . 



True Scores 

Note. — The shaded area indicates scores that are not obtainable 
assuming that the maximum score on the test is 30. 
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make asaumptions about the ex ^-ors of measurement. A 
little over a decade ago Keats and Lord (1962) postulated 
that a reasonable distributional form for the errors of 
measurement (assuming that observed scores are bounded) 
is the binomial distril^ution with its parameter equal 
to a specified true score. Keats and Lord (1962) give 
no indication that they were, at that time, even con- 
sidering what we now call criterion-referenced tests; 
however, as will be demonstrated, the implicf:tions of 
this assumption correspond quite well to a working 
definition of the distributional form of many criterion- 
referenced test scores. 

More extpnsive discussions of the binomial error 
model can be found in Keats and Lord (1962>, Keats (1964), 
Lord (1965), Lord and Novick (1968) and Brennan (1970), 
The last reference is intended to provide a simplified 
and concise description of the binomial error model, 
especially for those interested in its possible appli- 
cation in criterion-referenced testing. In this report 
we will merely provide a brief outline of the binomial 
error model. 

As far as notation is concerned, subscripts for 
variables will be dropped unless they are required to 
avoid ambiguity. As before, T represents true score, 
X represents observed score, and E represents error cf 
measurement. However, rather than T, the true score 
number of items correct, we will be concerned primarily 
with 5, the true score proportion correct; i.e., 

C « T/N 

where N is the number of items on the test. Note that 
the observed score, X, is a discrete variable, while the 
true score, C (as well as T) , is assumed to be continuous. 
Distributions will be identified as follows: 

<^(X) = the distribution of observed scores, 

g(C) = the distribution of true scores, 

f{E|c) » the distribution of the errors of measure- 
ment for given true srore , and 

h(x|;) a the conditional distribution of obsei^ved 
scores for given true score. 

Recall that the binomial error model is a strong 
true score model; i.e., it incorporates an assumption (s) 
over and above the assumptions of the classical weak 
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Figure. 2-7 

The Conditional Distribution of Observed Scores 
for Several Given True Scores under the 
Binomial ErrcTr Model for a 20-Item Test 



.35 




1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 

Number of Items Correct 
(Observed Score*^) 



Note. — The above figure represents five different observed score 
distributions determined by the parameter C • This figure represents 
the same information as that contained in Figure 2-6. 
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true score model. The binomial error model does not, 
therefore, violate any of these classical assumptions 
Al -A67 all these assumptions still hold. 

In addition, however, for the binomial error mndel, 
it is assumed that for a given true score, 5, the errors 
of measurement, E are independent and have a binomial 
distribution _with parameter ; i.e.. 



for given true score where N. is the number of test 
items. This assumption can be stated as follows: 
the conditional distribution of observed score X for 
given true score ^ is the binomial distribution with 
parameter r^i i.e.. 



0 



h(X|c) » j I ;^(1 - C)^"^ , X =» 0,1, N. 



The first of these two formulas is illustrated for 
a 20-item test in Figure 2-6. It is instructive to 
compare Figure 2-6 with Figures 2-1 and , which 
illustrate the error assumptions for the classical model 
and the "normal" model, respectively. The second of 
these two formulas is illustrated in Figure 2—7. 

Mathematically, assuming a linear regression of 
true scores on observed scores, the above assumptions 
imply that observed scores have a hypergeometric distri- 
bution and true scores have a beta distribution. Both 
of these distributions can ake on the negatively skewed 
characteristic of many criterion-referenced and mastery 
test score distributions. 



Several times in the above discussions we have 
assumed that the regression of true scores on observed 
scores is linear.* We also mentionned that this 
assumption is not always true; however, Lord and Novick 
(1968) claim that departures from linearily are pro- 
bably not too great, in most cases. This is one reason 
we have stuck with the linearity assumption. Another 
reason is that any non-linear assumption about the regres 
sion of true on observed scores would necessitate rela- 
tively complicated calculations in order to determine the 



regression equation, the distribution of obrerved scores, 
and the distribution of true scores, A third reason is 
that it seems wise to consistently assume the linearity 
of the regression of true on observed scores so that the 
reader can more effectively compare the test theory 
models discussed, 

A fairly up-*to-date and extensive discussion of 
applications of the binomial error model (for settings 
not necessarily related to criterion-referenced tests) 
is given in Lord (1905) • A further extension of the 
binomial error model (called the "compoind binomial 
error model**) is given in Lord and Novick (1968) • 

Summary and Discussion 

We have reviewed the classical test theory model 
and found it to be generally applicable to criterion- 
referenced testing with two reservations: (a) there is 
some doubt about the applicability of the model for the 
subset of persons who have the highest possible true 
score and (b) the model may be appropriate, but not 
sufficient, for criterion-referenced testing. 

Also, we have reviewed the implication of the 
classical test theory assumptions upon errors of 
measurement, and we have reviewed the normal and binomial 
error models. We find that, for criterion-referenced 
testing, if the classical model is to be used, then 
the binomial error model assumptions are more appro- 
priate than the normal error model assumptions, in most 
cases. However, we should note in passing, that it is 
considerably more difficult and time-consuming to use 
the binomial error model than to use the normal error 
model. 

In the context of this chapter, the normal and 
binomial error models provide us with alternative ways 
to estimate a person's true score given that person's 
observed score. Our discussion of error scores and 
their distribution is not necessarily appropriate for 
considering whether a student is above or Below a 
mastery cutting score. Hambleton and Novick (1973) seem 
to consider this issue to be the crucial issue in 
criterion-referenced testing. It seems to me that whether 
or not a person is above or below a mastery cutting score 
is critical in mastery testing, but may not be critical 
for criterion-referenced testing, in general. Recall that 
a criterion-referenced test must nave a "specified 
performance standard," but this "standard" does not 
necessarily require a mastery cutting score. 
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In any case, whether one is dealing with mastery 
testing or its progenitor ^ criterion-referenced testing^ 
it seems to this author that the estimation of a 
student's true score is a critical consideration. If 
one assumes the classical definition of true score, then 
the binomial error model seems appropriate; if one 
assumes a definition of true score that depends upon a 
mastery cutting score, then Hambleton and Novick's (1973) 
suggestions seem reasonable, but even their suggestions 
depend indirectly upon the classical definition of true 
score; and, finally if one assumes a different definition 
of true score, then different assumptions about errors 
may be necessitated. 

In conclusion, our discussion of the distributions 
of different kinds of scores may seem inconsistent 
with previous statements about the irrelevance of 
score distributions in criterion-referenced testing. 
To be more specific ^ as far as the interprets tion of a 
set of criterion-referenced test scores is concerned, the 
distributions of observed and true scores over persons 
are ' irrelevant; however, assuming that one wants to 
estimate a given person's true score, one must make 
assumptions about the distribution of errors of measure- 
ment for that person, or, more specifically, one must 
make assumptions about the distribution of errors of 
measurement for all persons who have an observed score 
equal to the given person's observed score. Therefore, 
the distribution of errors of measurement is critical 
in criterion-referenced testing, eventhough the distri- 
butions of observed and true scores over persons are not 
relevant. It just *so happens that a unique description 
of the distributions of true .end observed scores is a 
by-product of assuming (a) the classical test theory 
model, (b) the linearity of the regression of true 
scores on observed scores, and (c) either the normal or 
the binomial error model. 
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CHAPTER ill 



Validity and Procedures for Constructing 
Criterion-Referenced Tests 



The most important aspect of any test is its 
validity; i.e., the extent tovhich the test measures 
what it is intended to measure. For criterion-refer- 
enced tests, most researchers view the question of " 
validity primarily as a question of content validity 
(see Popham and Husek, 1969). For the most part, this 
author agrees with this view. However, our concern for 
content validity argues that we also consider the most 
important procedures involved in constructing criterion- 
referenced tests. It is these procedures that provide 
a basis for inferring the extent to which a test has 
content validity. 

For our purposes, let us consider five steps in the 
developemnt of criterion- referenced tests: (a) the 
establishment of a domain of relevant behaviors, (b) the 
de\'8lopment of a procedure to generate items, (c) the 
development of an item sampling plan, (d) the development 
of a procedure to administer items, and (e) the collection 
of data and the revision of the test. In the following 
sections, we will treat important aspects of each of 
these issues and provide major references for the reader 
interested in more detail. Many of the issues discussed 
in this chapter and the next two chapters are also 
treated from a somewhat different point of view by 
Rovinelli and Hambleton, 1973. 

The Development of Criterion - Referenced Tests 

Domain of relevant behaviors. The first step in the 
development ol a criterion-referenced test entails speci- 
fying and categorizing all of the behaviors which are to 
be tested. Operationally, this frequently means 
specifying and categorizing a set of objectives; thus, 
the task is analogous to constructing a blueprint for a 
norm- referenced test. A more specific approach to the 
task of establishing (and using) a domain of relevant of 
behaviors is called "domain-referenced achievement 
testing"; Hively et al., 1973, provide an excellent 
statement of this model. 
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Two questions usually ar.rse when one attempts to 
specify the domain of relevant behaviors: (a) how 
extensive should the domain be? and (b) what is the 
nature of the domain? This author knows of no 
generally accepted procedure for defining the extent 
of the domain. Frequently, the extent of the domain 
corresponds with the extent of the subject matter to 
be covered. in a certain course, in a particular 
segment of instruction, or in a particular time period. 
From a measurement point of view, it is probably 
advisable that a domain, or each unambiguously defined 
subset of the domain, contain or reference a set of 
objectives which is tested by a single criterion- 
referenced test. If this advice is. followed, then, of 
course, the objectives in a domain, or each subset of 
the domain, should be closely related. When these condi- 
tions prevail, the items in a given criterion-referenced 
test will be testing similar objectives, and, therefore, 
the interpretability of a criterion-referenced test 
score will, in general, be enhanced. 

The nature of the domain may be considered as the 
way in which the objectives or elements of the domain 
are inter-related. When viewed in this manner, we can 
say that a domain can be characterized by: (a) no 
hierarchy, (b) a liriear hierarchy, or (c) a complex 
hierarchy (i.e., a hierarchy having different branches). 
Now, one can postulate learning hierarchies (i.e., 
hierarchies indicating an optimum or desired order in 
which objectives should be taught to students in order 
to maximize learning) or knowledge hierarchies (i.e., 
hierarchies indicating which objectives are logical 
pre-requisites to attaining other objectives) . In these 
terms, the hierarchy in our "domain of relevant behaviors** 
is typically a knowledge hierarchy, which need not 
necessarily correspond to a learning hierarchy for the 
subject matter under consideration. It should be noted 
that it is not always necessary to specify a knowledge 
hierarchy even if one exists or can be postulated. 
However, a knowledge hierarchy is at least useful and 
often essential when one undertakes sophistocated pro- 
cedures for item sampling and/or item administration 
(see discussion below) . 

Procedures to generate items. At the present time, 
there are fundamentally two procedures for the generation 
of criterion-referenced test items: (a) have content 
specialists write items and (b) use item forms. 

In many areas of education, the item forms approach 
is not feasible, from a practical point of view, at this 
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time. Consequently, one must have content specialists 
write test items in these areas. There are, however, 
problems in having content specialists write test items. 
In fact, most of the typical issues that surround con- 
tent validity emanate from a consideration of whether 
or not the items written by content specialists are 
unaunbiguous measures of intended objectives at the 
intended level of difficulty. ' It is especially diffi- 
cult for content specialists to write "equivalent" test 
items for a given objective, aud this is frequently a 
large part of the item writing task for criterion-refer- 
enced testing. 

During the last few years, an excellent theoretical 
foundation for item generation has been provided by 
literature on "item forms," a term originally introduced 
by Hively (1962). Item forms make it possible to define 
an entire class of items merely by substituting elements 
of replacement sets for variable elements in the item 
forms. There are at least two important advantaages of 
this item generation technique: (a) item forms provide 
a concrete basis for generalization to a domain of 
content, thus providing a sound basis for examining con- 
tent validity, and (b) item forms allow for the possibility 
of generating a large number of equivalent item?. In 
addition, Nitko (1970) argues that the analysis of a 
content area through the use of item forms provides a 
sound basis for the "systematic study of the domain of 
instructionally relevant tasks in terms of its structural 
and behavioral parameters (p. 10)." 

The literature indicates basically three approaches 
to the construction of item forms. Hively et al (1968) 
and Ferguson (1969, 1971) use item forms primarily 
characterized by numerical replacement sets; Osburn (1968) 
usciS item forms that employ both numerical and non- 
numerical replacement sets; and Bormuth (1970) argues 
for the use of item forms that incorporate linguistic 
transformational rules. All of the above researchers have, 
to varying degrees, treated the computer generation of test 
items through the application of item forms. Perhaps one 
of the best examples is provided by Ferguson (1969). 

Item sampling . In considering item sampling in the 
context of criterion-referenced tests, it is useful to 
recall that: (a) each objective has at least one and 
usually many "equivalent" test items associated with it 
and (b) the domain consists of a set of possibly inter- 
related objectives. Now in choosing items for a criterion- 
referenced test there are a number of possible sampling 
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schemes that might be employed.* For exaunple^ the test 
might consist of : (a) all items, (b) a random sample 
of itemSy (c) a stratified random sample of items where 
stratification occurs with regard to objectives, or 
(d) a representative sample of items. Kriewall (1969) 
and Lord and Novick (1968) provide a partial considera- 
tion of these sampling- plans . 

A stratified random sample is perhaps the most 
common sampling method in criterion-referenced testing; 
however r the exact nature of the sampling plan and the 
saunpling fractions are, unfortunately, seldom specified 
in detail. One reason that stratified random sampling 
is so popular is that it is practically ideally suited 
to the item forms approach to the generation of test 
items. Often, the item forms are the strata, and each 
item form provides a method of generating 'a set of items 
from which a random sample is drawn. 

Item administration . Often, a criterion-referenced 
test is , as are most norm-referenced tests, a fixed 
entity; i.e., for each person taking the test, the items 
are the same, and the order in which the items appear on 
the test is the same for all persons. Sometimes the 
order of administering items varies for each student, or 
for several sets of students. Less frequently, different 
students are given different items; in such cases, it 
also frequently occurs that different students receive 
different numbers of items which may ^ven come 'from 
different strata. This last kind of item administration 
technique can be referred to generically as "adaptive 
testing." A specific kind of adat>tive testing is called 
sequential testing, the statistical aspects of which are 
treated by Wald (1947) . More recently Kriewall and 
Hirsch (1969) consider sequential testing in the context 
of criterion-referenced testing. 

Adaptive testing (see Brennan, 1973) has been employed 
in both criterion-referenced and norm-referenced testing 
situations. This testing technique has been called 
"tailored testing" (Lord, 1971) , "branched testing" 
(Ferguson, 1969, 1971), "programmed testing" (Linn et al, 
1969) and "sequential testing" (Linn et al,, 1970). 
These types of tests have some elements in comnion; 
however, there are often differences among the ways 
these terms are used by particular researchers. Therefore, 
we will group all of the above testing techniques under 
the general headiag of "adaptive testing," as distinct 
from "conventional testing" in which all items are adminis- 
tered to all examinees usually during a fixed-length time 
period. 
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In general, current research on adaptive testing can 
be divided into two types: (a) adaptive testing for norm- 
referenced testing and (b) adaptive testing for criterion- 
referenced testing. Lord (1970, 1971) examines relevant 
issues in norm-referenced adaptive testing, while 
Ferguson's (1969, 1970) research treats criterion-refer- 
enced adaptive testing. Linn et al (1969^ 1970) incor- 
porate aspects of both types of adaptive testing, although 
they seem to view the achievement testing process primarily 
from a norm-referenced viewpoint. 

Norm-referenced adaptive testing involves tailoring 
item difficulties to examinees in such a way that examinees 
spend most of their time answering items at or near their 
ability level. The objective is to determine an individ- 
ual's relative potition on some hypothetical continuum of 
underlying ability. Unfortunately, theoretical work by 
Lord (1970) indicates that the types of norm-referenced 
adaptive tests thus far examined have serious limita- 
tions — they do not provide "greatly improved measure- 
ments for most examinees. The value of (these) tests is 
primarily for those examinees for whom the conventional 
test would be too easj or too difficult (Lord, 1970, 
p. 153)." Taus, at this time, it appears that adaptive 
testing offers no sig: ificant advantages for the conven- 
tional types of norm-referenced tests. It is interesting 
to note, however, that most of the above research is of a 
theoretical nature; in practice, the computerized admin- 
istration of such tests might yield significant improve*- 
ments in reliability and validity per ui;ic of testing 
time. 

In any case, testing within the context of instruction 
typically involves a different kind of measurement from 
that discussed by Lord (1970, 1971). In instruction, it 
seems more appropriate, in most cases, to employ criterion- 
referenced measurement instruments in such a way that 
decisions can be made concerning whether or not each 
student has achieved a desired level of proficiency i 

In the opinion of this author, the best example of 
criterion-referenced adaptive testing for instructional 
decisioa-making is provided by Ferguson (1969). He 
postulated a knowledge hierarchy for elementary addition 
and subtraction problems and developed a computerized 
system that employs ths theory of item forms to generate 
a set of criterion-referenced test items for each of the 
nodes of the hierarchy. Then, using the sequential 
probability ratio test (Wald, 1947) as a primary basis 
for decision-making, he created an adaptive test which / 
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"when coinpared to ... conventional tests (for determining 
proficiency in elementary addition and subtraction) • • . 
seems comparable or superior in all respects (Ferguson, 
1969, p. 88)." Furthermore, his test was effective and 
efficient for determining the proficiency of all exami- 
nees, even those in the middle range of proficiency. 

Thus, while the research findings for adaptive testing 
in the norm^ref erenced context are less than promising, 
the findings for criterion-referenced adaptive testing in 
the context of instructional decision-making are quite 
encouraging . 

Data collection and test revision . In a subsequent 
chapter , we discuss the rr?le of empirical data in the 
revision of test items. Here, we merely outline several 
Issues of general importance. 

Ideally one should collect and analyze data from 
all subjects who take the test in order to idenLify test 
items, and other aspects of the test, that require 
revision. If this is not feasible, one can analyze data 
from a random sample or representative sample of subjects; 
however, one must be careful to obtain a large enough 
sample so that item statistics are reasonably stable. 
The experience of this author indicates that the 
minimum sample size should be about 25-30 subjects, if 
at all possible. 

A second consideration is that empirical data should 
not form the sole basis for the revision of a criterion- 
referenced test. Data may indicate the potential need 
for revision; however, whether or not revision is 
actually undertaken should ultimately depend upon the 
judgment of subject matter specialists who have studied 
the data and weighed the trade^-offs involved in revision. 

A third consideration is that the process of creating 
a good criterion-referenced test (or any test, for that 
matter) is a cyclic process of replication and revision. 
If the revision process employed is adequate, then each 
revision of the test should be an improvement of the 
previous version. This last statement may appear trivial.; 
however, it should be noted that not all revision 
procedures necessarily result in improvement . 

Some Issues Concerning the Validity of Cr iter ion - Referenced 
Tests 



We mentioned at the beginning of this chapter that 
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content validity is a principal concern for criterion- 
reference^ tests. If the procedures indicated above 
are followed carefully^ then one has a reasonable 
expectation of obtc^ining a criterion-referenced test 
having content validity. Perhaps the most crucial 
issue is what Dahl {I97I) calls "objective-item con- 
gruence", i.e., the extent to which the criterion- 
referenced • items are appropriate measures of the objec- 
tives in the domain of intended behaviors. 

The critical nature of objective-item congruence 
provides, I think, an important argument in favor of 
the item forms approach to the generation of test items. 
The item form is usually an operational definition of the 
objective, and the item form provides a basis for 
generating the test items for the objective; hence, one 
has a strong logical basis for arguing that the test has 
content validity (in the sense of objective-item con- 
gruence) when one uses the item forms approach to the 
generation of test items. There is, however, one 
caution that should be noted concerning the use of item 
forms — the items resulting from a particular item form 
are not necessarily equivalent in a statistical sense. 
For example, it is not nececsarily true that items 
generated frorr the same item form will all have the scune 
(empirical) difficulty level. 

When subject matt-er specialists generate items, it 
becomf^s nt^cssary to employ some judgmental procedures 
ixi uider to assess the content validity of the criterion- 
referenced test. Such judgmental procedures usually 
entail assessing the extent to which subject matter 
specialists, working independently, agree that the test 
has OD jective- ^ tern congruence. There are a number of 
procedures for assessing agreement between or among 
judges. The reader may be interested in referring to 
Light (1973) for an excellent review of the literature 
in this area. Three potentially useful techniques, 
in the opinion of this author, have been discussed by 
Lu (1971), Hemphill and Westie (1950), and Brennan 
and Light (1973) . 

Another issue rel ating to the question of val idity 
involves the nature of the student scores on the test. 
A cr: erion-ref erenced LcSt, by definition, necessitates 
"measurements that are directly interpre table in terms 
of specified performance standards (Glaser and Nitko, 
1971).*' Therefore, the extent to which a criterion- 
referenced test is valid depends upon the extent to 
which scores reported on such a tesL are interpretable in 
terms of specified performance standards. For example. 
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one may be concerned about the proportion of items, from 
a given domain, to which a student knows the answer. 
In this case, a student's score is valid to the extent 
that the observed proportion is free of random and 
systematic errors of measurement • Or one may be concerned 
about whether or not the true (in the sense of '•actual") • 
proportion of items to which the student knows the 
answer is above or below some mastery cutting score. 
In this case, students' scores are valid to the extent 
that both random and systematic errors of classification 
(of students above and below the mastery cutting score) 
are eliminated. 

The above observations point to a central relation- 
ship between reliability and validity for criterion- 
referenced tests (or any test, for that matter). This 
relationship may be stated as follows: a test is 
reliable to the extent that scores resulting from it are 
free of random errors of measurement; a test is valid to 
the extent that scoies resulting from it are free of 
both random and systematic errors of measurement. 
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CHAPTER 1^ 



Reliability of Criterion - Referenced Test Scores 

The most frequently considered statistical issue 
surrounding criterion-referenced measurement involves 
the reliability of such measures. In this chapter > 
we review fundamental ideas ibout reliability, we consider 
several problems in employing norm-referenced reliability 
measures for criterion-referenced tests, and we provide 
a critical review of most of the reliability measures 

that have been suggested for criterion-referenced tests. 

Classical Notions about Reliability 

In the classical test theory model, reliability is 
defined as either (a) the squared co^^relation between 
true scores and observed scores or (b) the ratio of the 
variance of true scores to the variance of observed 
scores. (Gulliksen, 1950, and Lord and Novick, 1968, 
treat the theory of reliability in considerable detail,) 
Neither of these two theoretical definitions of relia- 
bility can be applied directly since they involve the 
unobservable true scores discussed in Chapter II. 

However, under the classical test theory model it 
can be shown that reliability is also equal to the corre- 
lation between parallel tests, v^;here parallel tests are 
defined statistically as tests that have equal means, 
equal variances, and equal intercorrelations (if there 
are more than two tests involved). Therefore, one method 
of determining reliability is to obtain the correlation 
over persons on parallel tests; this is called a measure 
of equivalence. 

Another measure of reliability is called a measure 
of stability, which is the correlation over persons of 
two separate administrations of the same test, with the 
assumption that no learning occurs between the first and 
second administrations. 

A third measure of reliability is called internal 
consistency. Typical measures of internal consistency 
are Kuder and Richardson's (1937) Formulas 20 and 21, 
Cronbach's Coefficient Alpha (195l), and Hoyt's (19U1) 
Reliability Coefficient. For the classical correct/ 
wrong scoring procedure these coefficients (with the 
exception of Formula 21) all provide identical results. 
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O^her kinds of internal consistency measures include 
measures of homogeneity and split-halves coefficients. 
Some authors consider measures of internal consistency 
as different from measures of reliability (e.g.. Brown, 
19 70); most authors, however, treat measures of internal 
consistency as a special kind of measure of equivalence. 
A critical point to recognize is that measures of 
internal consistency employ only one administration of 
one test, 

A measure of reliability in and of itself is essen- 
tially a statistic characterizing the extent to which a 
test is a dependable measurement instrument. However, 
indirectly a reliability coefficient provides a basis for 
making inferential statements about true scores and 
observed scores. (See discussion of errors of measurement 
in Chapter II . ) 

Also, although we usually consider reliability as a 
measure involving a test of fixed length, one can use the 
Spearman-Brown Prophecy Formula to estimate the reliability' 
of a test of any length. An important special case is the 
reliability of a one-item test, which is mathematically 
equal to the intraclass correlation coefficient for the 
test of full length. The reader interested in new and 
important developments concerning these and other related 
issues should consult Cronbach et al (1972). 

Another important issue in reliability theory involves 
the reliability of change scores. (See, for example, 
Harris, 1963, Tucker et al, 1966, and Cronbach et al, 1970). 
For example, one typically judges the effectiveness of an 
instructional system in terms of pretest-posttest changes 
in student performance. Therefore, in order to judge the 
effectiveness of an instructional system one needs to know 
the reliability of these change scores. This is a very 
complicated issue and one that has not received a great 
deal of treatment from a criterion-referenced testing 
viewpoint. 



Problems in Using Norm - Referenced Reliability Indices 
for Crite r ion - Rererenced Tests 

A few years ago Popham and Husek (1959) stated: 

"... it is obvious that a criterion-referenced test 
should be internally consistent. If we argue that 
the items are tied to a criterion, then certainly 
the items should be quite similar in terms of what 
they are measuring. But although it may be obvious 
that a criterion-referenced test should be inter- 
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nally consistent, it is no.t obvious how to assess the 
internal consistency. The classical procedures 
are not appropriate. This is true because they are 
dependent upon score variability. A criterion- 
referenced test should not be faulted if, when 
administered after 'instruction, everyone obtained 
a perfect score. Yet, that would lead to a zero 
internal consistency estimate, something measurement 
books don * t recommend . 

In fact, even stranger things can happen in 
practice. It is possible for a criterion-referenced 
test to have a negative internal consistency index 
- -and still be a good test. ... 

Other aspects of reliability are equally cloudy. 
Stability might certainly be important for a 
criterion-referenced test> but in that case, a test- 
retest correlation coefficient, dependent as it is 
on variability, is not necessarily the way to asses?^ 
it. Some kind of confidence inLerval around the 
individual score is perhaps a partial solution to 
this problem. 

The reader should not misinterpret the above 
statements. If a criterion-referenced test has a 
high average inter-item correlation, this is fine. 
If the test has a high test-retest correlation, that 
is also fine. The point is not that these indices 
cannot be used to support the consistency of the test 
The point is that a criterion-referenced test 'could 
be highly consistent, either internally or tempor- 
arily, and yet indices dependent upon variability 
might not reflect that consistency, (pp. 5-6)" 

Clearly, the major issue that Popham and Husek 
consider is the very real possibility that a set of 
criterion-referenced test scores may not display much 
variance. In this case, the classical measures of relia- 
bility are apt to be inappropriate.. This is perhaps the 
most frequently cited reason for the need to develop new 
•measures of reliability for criterion-referenced tests. 

Another frequently cited reason for developing new 
indices is that criterion-referenced tests frequently 
employ a mastery cutting score that is intended to be 
independent of the distribution of observed (and true) 
scores. The presence of this cutting score argues that 
an important issue in the reliability of mastery tf^sts 
involves the extent to which the test is a dependable 
instrument for assessing whether or not persons surpass 
the mastery cutting score. 
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These are the two most frequently cited reasons for 
pursuing the development of new measures of reliability 
for criterion-referenced tests. Our discussion of parti- 
cular indices in the next section will build upon and, in 
some cases, further refine these reasons. 



Criterion - Referenced Reliability Indices 

The literature contains a number of suggested statis- 
tics for estimating the reliability of criterion-ref- 
erenced test scores. In thissection we describe most of 
these indices and, when appropriate, we comment on their 
characteristics, strengths, and weaknesses • 

Ivens ' agreement indices . Ivens (1970) argues that 
measures of reliability for criterion-referenced tests 
should be independent of test score variance; therefore, 
the measures he proposes are based upon a consideration 
of different kinds of agreement. 

First, Ivens considers reliability using the concept 
of within subject equivalence of total scores* "For each 
subject, the raw score for the two administrations, 
either test-retest . or parallel forms, (is) converted into 
percent-correct scores. For each examinee, the absolute 
difference between the percent correct on the two admin- 
istrations (is) obtained. ... The actual reliability index 
(consists) of reporting ... the percent of subjects with 
percent-difference scores of ci given size or less (Ivens, 
1970, p. ID." This measure can be expressed algebrai- 
cally as follows; Let 

X.., = the response of person i (i = 1,2, N) , 



to item j (j = 1,2, K) 
on test 1 (1 = 1,2) where 

1=1 means the first administration of the 
test (or the first of the two parallel 
tests) and 

1=2 means the second administration of the 
test (or the second of the two parallel 
tests . 



if |x._. 

if |X._^ 




where c is 
range 0 <= 



some tolerance limit in the 
c <= 1.0. 
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Now, the reliatility (in the sense of agreement) over 
persons, given a tolerance limit of c is: 



AP(c) s (l/N) Z A. 

i ^ 

Note that AP(c) is a proportion, and there are as many 
possibly different values of AP(c) as there are values of 
c. If we plotted AP(c) against c, then we would observe 
that AP(c) is a monotonically non-decreasing function of 
c. Thus, in order to report AP(c) in its entirety, we 
should report something like a plot of AP(c) for 
0 <= c <= 1.0. If this procedure is not followed, then, 
at a minimum, one could report several selected values 
of AP(c) . 

Second, Ivens considers test reliability as the 
average of the individual item reliabilities where item 
reliability is expressed by calculating the proportion 
of subjects whose item scores (pass-fail or correct- 
wrong) are the same on the test and the retest, or on the 
test and the parallel form. Using this line of reasoning 
the reliability for item j is defined as: 



AI , = (l/N) Z A. . , where 
3 i ^3 



1 if X^j, = X. -2 
0 if X.j, i X. -2 



Thus, test reliability is defined as: 

AI = (1/K) E AI. 

i ^ 

= Cl/(NK)] Z Z A. . 

i j ^3 

= (l/N) Z C(l/K) E A^.] 

- J 

The measure AI is very appealing in that it is a linear 
function of item reliabilities. Thus, for example, if 
one knows the reliability of each of the items in an item 
bank, then one can estimate the reliability for any test 
(i.e., for any subset of items that might be selected from 
the item bank) . 
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Ivens ' measures have several appealing characteristics. 
First, they are distribution-free. Second, they do not 
depend upon test score variance. Third, they are simple 
to calculate. Fourth, they can be used to calculate 
measures of stability or equivalence. Fifth, they are 
relatively easy to interpret. 

From a different point of view, we note that these 
measures have certain characteristics that some may 
consider undesirable. First, they are not interpretable 
in terms of the ratio of true score variance to observed 
score variance; therefore, they are not measures of 
reliability under the classical test theory model. Second, 
Ivens' indices are not likely to provide a great deal of 
help to the researcher interested in estimating a person *s 
true score, which is, indirectly, a typical function of a 
reliability index under the classical test theory models 

Berger - Carver mastery agreement . Berger (19 70) and 
Carver cl970) consider a method similar to Ivens* 
agreement indices for assessing the reliability of a 
criterion-referenced test. In the Berger-Carver case, 
however, a subject *s score is treated as a dichotomous 
variable; i.e., a subject is placed into a mastery or a 
non-mastery group depending upon whether the subject 
surpassed or failed to surpass some minimum performance 
level, or mastery score. On a test-retest or parallel 
forms basis, a subject *s two scores constitute an agree* 
ment if they result in the same classification; and the 
reliability measure is the agreement proportion over 
subjects. Letting 

A = the number of subjects who scored above the 
mastery cut-off on both the test and retest 
(or both parallel forms), 

3 = the number of subjects who scored below the 
mastery cut-off on both the test .and the 
retest (or both parallel forms), and 

N - the LuLdl number of sub jects, 

the Berger-Carver mastery agreement (reliability) measure 
can be expressed as: 

BC-MA = (A + B)/N 

A major conceptual difference between the BC-MA' 
statistic and Ivens* indices is that the BC-MA statistic 
involves a mastery cut-off, while Ivens* indices do not. 



Another difference is that, for the BC-MA statistic, a 
student's total test score functions as an intermediate 
score — intermediate to scoring the student 1 (master) 
or 0 (non-master). In most other respects, the advan- 
tages and disadvantages noted for Ivens' measures apply 
to the BC-MA statistic as well.. 

Marshall ' s index of separation , Marshall (1973) 
proposed an index baseH^upon the assumption that the 
population taking a criterion-referenced test is the union 
of two subpopulations , either of which may be empty. For 
the "knowledgeable** subpopulation , the expected value of 
a person's score is assumed to be equal to the number of 
items in the test; for the "not knowledgeable" subpopu- 
lation, the expected value of a person's score is assumed 
to be equal to zero. Marshall's index of separation is 
defined as : 



N 2 

SEP = 1.0 - (i+/nN) I (X. - Xf/n), where- 

i = l ^ 

n = the number of test items, 
N- = the number of persons, and 

= a person's total score (number of items correct) 
on the test. 

This index has a range of zero to one, and it is related 
to the variance of the total scores by the formula: 



SEP = 1.0 - i+[pq - ((N-l)/(n^N))s^], where 

p = mean proportion correct over items (i.e., mean 

item_dif f iculty ) and 
q = 1 - p 



Marshall notes that the index of separation stays constant 
at 1.0 when (a) total scores are all zero, (b) total scores 
are all n, and (c) total scores are half zero and half n. 
Thus, Marshall's index of separation obviates one of the 
objections to classical reliability indices for criterion- 
referenced tests namely , the clas sical formulas give 
a reliability of zero when the variance of total scores 
is zero* 
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Harris ' index of efficiency . Harris (1972a) proposed 
an index of efficiency defined as: 



D W 

where the between-groups and the within-groups sums of 
squares are determined by the two groups resulting from 
dichotomizing subjects into masters and non-masters. 
Harris states that the purpose of his index is "to measure 
how well the test sorts defined samples of students into 
(mastery and non-mastery) categories and possibly to 
measure its efficiency in this sense (Harris, 1972a, p. ^)." 
Harris points out that EFF can be viewed as the ratio of 
true score variance to observed score variance if a 
subject's true score is defined as the mean of that 
subject's group (mastery group or non-mastery group). 

• Hambleton - Novick indices . Hambleton and Novick (1973) 
state that "in most cases , the pertinent question (in 
criterion-referenced testing is whether or not the indi- 
vidual examinee has attained some specified degree of 
competence on an instructional performance task (p. 160)." 
Hambleton and Novick interpret the "specified degree of 
competence" as a mastery score. 

In order to consider the reliability index they 
propose, we must first review their decision-theoretic 
approach to criterion-referenced measurement. This ap- 
proach is considerably different from other approaches 
reported in the literature, and, in the opinion of this 
author, the decision-theoretic approach has much to rec- 
commend it, at least from a theoretical viewpoint for 
mastery testing. Their approach is similar, in some 
respects, to the "quota-free" selection problem discussed 
in Cronbach and Gleser (1965 ). "That is, .there is no quota 
on the number of individuals who can exceed the cut-off 
scores or threshold on a citerion-ref erenced test 
(Hambleton and Novick, 1973, p. 163)." Again, it should 
be noted that Hambleton and Novick are using the term 
"criterion-referenced test" in the sense of "mastery 
test." 

Quoting from Hambleton and Novick (1973): 

"The primary problem in the new instructional 
models, such as individually presecribed instruc- 
tion, is the one of determining if the student^s 
true mastery level, is greater than a specified 
standard "fr . Here, "fr. is the "true" score for an 

• •• 0«« X 

individual i m some particular well specified 
content domain. It may represent the proportion of 
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items in the domain he could answer successfully. 
Since we cannot administer all items in the domain, 
we sample some small number 'co obtain an estimate 
of TT . , represented as t.. The value of tt is the some 
what^ arbitrary - . threshold ° score 

used to divide individuals into the two categories 
described earlier, i.e., Masters and Non-masters. 

Basically then, the examiner's problem is to 
locate each examinee in the correct category. There 
are two kinds of errors that occur in this 
classification problem: False positives and false 
negatives. A false-positive error occurs when the 
examiner estimates an examinee's ability to be above 
the cutting score when, in fact, it is not. A false- 
negative error Occurs when the examiner estimates an 
examinee's ability to be below the cutting score when 
the reverse is true. The " seriousness of making a 
false-positive error depends to some extent on the 
structure of the instructional objectives « It would 
seem that this kind of error has the most serious, 
effect on program efficiency when the instructional 
objectives are hierarchial in nature. On the other 
hand, the seriousness of making a false-negative 
error would seem to depend on the length of time a 
student would be assigned to a remedial program 
because o-f his low test performance. (Other factors 
would be the cost of materials, teacher time, facil- 
ities, etc.) The minimization of expected loss would 
then depend, in the usual way, on the specifi.ed losses 
and the probabilities of incorrect classification. 
This is then a straightforward exercise in the mini- 
mization of what we would call threshold loss. 

In an attempt to view the above discussion in a 
more formal manner, suppose we take some criterion 
level TT^, and define a parameter o) such that 

0) = 1 if 7T :|F TT 

— o 

0) = 0 if tt < tt 

o 

Persons having o) values of one are those who 
have true ability levels equal to or greater than 
the criterion level tt , and those having o) values of 
zero are those whose tt values are below tt . 
Now if we obtain an estimate of tt^, then an ° estim.ate 
of 0) would te obtained in the following way: 
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= 1, if ft- > TT and 
1 o 

6 = 0, if ft. < IT • 

Defining our error of estimation as (fl - o))^ the 
difference between, the estimated and the true value,, 
it is clear that the error takes on one of three 
values; +1,-1, 0, correspbnding to whether we make 
a false-positive error, a false-negative error, or 
a correct classification. Also, note that the 
squares of the errors and their absolute values are 
identical. Thus, any procedure that' minimizes 
squared-error loss (SEL) in the co metric also mini- 
mizes absolute-error loss (AEL) in that metric. The 
criterion-referenced measurement problem is, thus, 
one of determining an estimator of o) by determining 
an estimator ft of tt with a threshold loss function 
and converting this to en estimate of ca • . . . Note 
that with threshold loss, the estimate ft of tt is not 
a single number but one of two intervals [0,7T ) or 
[tt^, 1], ... The minimization of SEL and ^ AEL 
in the to metric is equivalent to the minimization of 
threshold loss for tt in the special case where the 
losses associated with false positives and false 
negatives are .equal (pp. 163-16H)," 

In order to make use of the procedure indicated above ^ 
one must obtain estimates for the t: . , In order to accom- 
plish this, Hambleton and Novick suggest a Bayesian solution 
that involves using the "direct information provided by the 
student's ... score (and) the collateral information con- 
tained in the test data of other students, (Another possi- 
bility and one worthy of future research is that of using 
the student's other subscale scores and previous history 
as collateral information.) (p. 165)." 

" Using the above approach, Hambl-^ton and Novick suggest 
two reliability coefficients. Firsv, assuming the existence 
of two tests that are parallel in the w metric, let 

= score for person i on first parallel test and 

^2i ^ score for person i on second parallel test. 

Then, one possible reliability coefficient is the correla- 
tion, over pcrsoxib,for the two parallel tests; i.e., 

HN-CORR = corrCO^^, C^^) . 
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Another measure of reliability they suggest is the propor- 
tion of times that the same decision is made with the two 
parallel measurements; i.e., 

HN-MA = A/N , where 

A = number of times = '*^2i' 

N total number of subjects. 

Clearly, these two measures are measures of equivalence; 
analogous measures of stability can be constructed 
directly . 

"'The critical problem in the Hambleton and Novick 
procedure involves the estimation of the tr. scores, which 
are the "true" scores -for the individuals. ("True score" 
is never defined by Hambleton and Novick; therefore, we 
assume here that they mean true score in the classical 
sense.) It should be noted that the authors* suggested 
Bayesian solution to the problem :.s, at the present time, 
an unsolved problem, since the most appropriate available 
procedure (Novick, Lewis, and Jackson , 1973) does not use 
a threshold loss function, according to Hambleton and Novick. 

Livingston ^ s coefficient . Livingstones (1972b) coef- 
ficient has undoubtedly received more attention (and criti- 
cism) than any other criterion-referenced reliability 
measure that has been reported in the literature. See, 
for example, Livingston (1972a, b,c); Harris (1972b, 1973), 
and Shavelson et al. (1972). Livingstones reliability 
coefficient can be expressed as: 

r V(X) + (7 - C)^ 

LIV = — K- , where 

V(X) (7 - C)^ 

r.^^ = any "norm-referenced reliability coefficient" 
based upon the classical test theory model, 
C = a mastery cutting score, 
Y = mean score over persons , and 
V(X) = variance over persons. 

Much of the discussion of LIV has involved some degree of 
misunderstanding about the nature of the coefficient; 
therefore, let us list a few characteristics of LIV: 

(a) LIV involves a consideration of the expected 
squared deviation of a person^s score from C, as distinct 
from the expected squared deviation of a person's score 
from 7 (the latter being a definition of variance). 
Therefore, LIV involves an "atypical" squared error loss 
function. • 
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(b) The range of LIV is C.0,1], and LIV is, therefore, 
similar to classical reliability coefficients, r i in this 
respect. LIV equals r. when C = 7, and LIV > r. 
when C * 7. 

(cj LIV is identical to a classical reliability 
coefficient when that coefficient is based upon two 
populations with means equally distant above and below C 
(Harris, 1972b). 

(d) The classical standard error of measurement is 
the same for both LIV and r^^; therefore, the (usually) 
larger value of LIV does ' not imply a more dependable 
estima^te of a person *s true score, in the classical sense, 
nor does it imply a more dependable determination of 
whether or not a true score falls above or below C 
(Harris, 19 7 2b). However, the usually larger value of 
LIV does imply "a more dependable overall determination 
of whether each true score falls above or below the 
criterion level, when this decision is to be mar»-» for 
every individual score in the distribution (Livingston,. 
1972a, p. 31)." 

(e) In geveral, there is no algebraic transformation 
of the observed test scores that produces a set of scores 
such that, when these scores are used in a classical 
reliability formula, the result equals LIV. One is 
tempted to think that this might be true if one used the 
deviation scores X. - C, but, since these scores are 
linear transformation of the X^. scores, the classical 
reliability of the deviations scores equals the classi- 
cal reliability of the original X^ scores. 

In the opinion of this author, the net result of 
these observations seems to be that LIV has som6 useful 
descriptive properties, ^f one accepts that Livingston's 
"atypical" squared error loss function is meaningful and 
appropriate. (Hambleton and Novick, 1973, are two 
researchers who seriously question the kind of squared 
error loss used by Livingston.) However, it is clear that 
LIV relies upon test score variance, and this character- 
istic is a negative factor, in the minds of many 
researchers. Alsn, it is clear that LIV does not enhance 
our ability to estimate a person's true score, even when 
LIV is very much greater that its corresp6nding classical 
reliability coefficient • 

In short, this author sees no compelling reason for 
generally abandoning the use of LIV as some might suggest; 
however, this author also feels that LIV should not be 
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considered as the answer to the question of measuring 
the reliability of a criterion-referenced test. Also^ 
users of LIV should be very careful to interpret and 
use this coefficient correctly. It is extremely easy 
to increase the value of LIV by moving C farther away 
from the mean; however, this should be done if and only 
if there is a substantively defensible reason for doing 
so. Finally, the reader should, note that LIV is really 
a coefficient for mastery testing, not for criterion- 
referenced testing, in general. 

Ozenne ' s sensitivity indices . Ozenne (1971) claims 
that in a criterion-referenced testing situation the 
important question is, "How effective has instruction 
been?" The rationale for his first sensiti^7ity index lies 
in •'the implicit assumption that if there is a difference 
in level of response on the two (testing) occasions, ... 
such a difference is due to the intervening instruction 
(Ozenne, 19 71, p. 17)." In Ozenne 's model the two 
testing occasions under consideration ere the pretest and 
the posttest. More explicitly, the model under consi- 
deration is: 

^ijk = ^ + »j + + + e. .j^, where 

TT « populat\on parameter, 
ttj = effect due to persons, j = 1,2, . • . N; 

9, = effect due to occasions (i.e., effect due to 
instruction), k = 1,2; 

(aB) = effect due to interaction of examinees (persons) 
^ and occasions factors; and 

e. = error of measurement, 
ijk 

Using this model, Ozenne 's first sensitivity index is; 



SENS^ = 



MS - MS. ^ 

occasions interaction 



MS - MS. + N-MS 

occas . inter. error 



This index is, in eCftjct, the variance due to instruc- 
tional effects (the occasions effect) divided by the sum 
of tiiC variances due to instructional effects and errors 
of measurement. 
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Ozenne's second index of sensitivity is given by 
the formula: 



SENS2 " 



MS — MCI 

treatment subjects w, treatment 

^^treat. -^^subjs,w. treat. ^ ^*^Wor 



This index is intended to be used when one has two 
different treatment groups one group receiving 
instruction end the other not. The underlying statis- 
tical model is: 

^ijk = ^ ^ \ ^j{k) ^ ^ijk ' "^^^^ 

7T = population parameter; 

= effect due to treatments, k = 1,2; 
a . f.. = effect due to persons nested within 



(k) 



treatment k; and 



e. = error of measurement. 

Other suggested indices . In addition to the 
Berger-Carver mastery agreeme : statistic. Carver (1970) 
suggests thac, "the reliability of a single form of a 
criterion-referenced device could be estimated by 
administering it to two comparable groups. The percen- 
tage that met the criterion is one group could be 
compared to the percentage that met the criterion in the 
other group (p. 56)." 

Cox and Graham (1966) and Ferguson (1971) suggest 
use of the coefficient of reproducibility for reliability 
estimatign when the criterion-referenced test items are 
assumed to form a Guttman Scale. 

Discussion . It seems appropriate to suggest some 
statements , of a comparative nature, concerning the above 
indices. 

First, all of the above indices, except those 
suggested by Ivens, Marshall, and Ozenne, are, more 
precisely, indices for mastery tests, since these indices 
depend , one way or the other , on the specification of a 
mastery cutting score. For these mastery test reliability 
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indices, it is important to observe that the fundamental 
or primary student score under consideration is often 
ambiguous. For example, is the fundamental score the 
number (or percentage) of items correct, or the extent 
to which this score is ^^ove or below the mastery score, 
or merely whether or not this score is above or below 
the mastery score? Another way to view this issue is 
to ask the question, "What is the appropriate error of 
measurement?" Only Hambleton and Novick (1973) address 
this issue in any depth. In short, there is a 
considerable lack of test theoretic justification 
(classical or otherwise) for many of the suggested 
reliability indices for mastery tests and, for that 
matter, criterion-referenced tests, in general. 

Second, several of the above indices (Marshall's, 
Harris', Livingston's, and possibly Hambleton and Novick 's) 
depend, directly or indirectly, upon the variance of 
student scores. Many researchers feel that the variance 
of student scores should exert no, or minimal, influence 
upon judgments about a criterion-referenced test's 
reliability or validity. 

Third, since Marshall's, Harris', and Livingston's 
indices involve oi ly one administration of a test, 
they cannot be considered measures of stability. For 
the most part,.thesv* indices seem to be measures of 
the extent to which :he test is dependable in its 
ability to classify s ibjects as masters or non-masters. 
Therefore, in a sense, these measures are analogous to, 
wh2it are usually called measures of internal consistency. 
Also, at least Livingston's index can be interpreted 
as s measure of equivalence. 

Fourt only Ozenne's SENS- index incorporates both 
pre- and po^etest scores? therefore, this index 

may appear to have the potential for assessing the 
reliability of change scores, whereas the other indices 
clearly do not have this potential • However, Ozenne ' s SENS 
index is primarily a measure of instructional effective- 
ness, not c? measure of the reliability of criterion- 
referenced change scores. Also, it should be noted 
that Ozenne '3 second index is not really a reliability 
index either; Ozenne's second index is merely another 
measure of instructional effectiveness. One could 
certainly argue, therefore, that Ozenne's indices should 
not even be discussed in this chapter^ eventhough other 
researchers have mistakenly cons idered Ozenne ' s indices 
to be measures of reliability. 
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Marshall (1973) provides additional information 
and insight into the characteristics and function of 
many of the above indices. 

In the opinion of this author: (a) Hambleton and 
Novick's indices have the most appealing theoretical 
rationale of those indices proposed for mastery tests, 
but one important statistical problem remains to be 
solved before these indices will be generally useful, 
(b) Livingston's index is mathematically similar to 
classical reliability indices, but it employs a 
questionable theoretical basis and is somewhat diffi- 
cult to interpret, (c) the indices attributable to 
Harris and Marshall may have practical utility, but 
this has not yet been demonstrated, and, in addition, 
both of these indices may be questionable from a 
theoretical point of view, (d) Ozenne's indices are 
not really reliability indices, eventhough they have, 
been treated as such by some authors, and (e) Iven's 
indices, as well as the Berger-Carver index, are 
appealing in several respects, and Iven's indices are 
tY2 most appropriate available indices for a criterion 
referenced test when a mastery cutting score is not 
employed, but none of these indices has yet received 
sufficient critical exaunination by researchers and 
practitioners. In short, many important issues 
surrounding the reliability of criterion-referenced 
measures remain unsolved problems, or, at best, these 
issues have not yet received adequately complete 
treatment in the literature. 
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CHAPTER V 



C riterion - Referenced Item Analysis 
and Revision Procedures Employing 
Clas'sical Scoring 

The differences between criterion-referenced 
and norm-referenced testing have led most researchers 
to conclude that norm-referenced item analysis proce- 
dures are of questionable value in criterion-referenced 
testing situations. (See, for example, Popham and Husek, 

1969, Popham, 1971, Cox and Vargas, 1966, and Brennan, 

1970. ) 

Yet, clearly, a criterion-referenced test car be 
no better than the items it contains. Therefore, if 
we are to develop reliable and valid criterion- 
referenced tescs, we need statistics to describe the 
performance of students on items, we need statistics 
for assessing item reliability and validity, and we 
need procedures for identifying poor or undesirable 
criterion-referenced test items. These topics are the 
subject of this chapter. Specifically, in this chapter 
we will consider: (a) item statistics for criterion- 
referenced tests, (b) a procedure for identifying 
criterion- referenced test items and instruction that 
require revision, and (c) tne use of item analysis 
tcLbles in cri ':erion-ref erenced testing situations. In 
practically all cases, in this chapter, the statistics 
and procedures we discuss entail the use of the classical 
correct/wrong scoring procedure. Other scoring proce- 
dures are considered in Chapter VI, 

The subject of item analysis and revision procedures 
for criterion-referenced tests is especially crucial 
and especially difficult. It is especially crucial in 
that the validity of a criterion-referenced test is very 
closely tied to the validity of the individual items. 
It is especially difficult in that: (a) there are few, 
if any, objective, empirically-based criteria for "good" 
criterion-referenced test items, and (b) even if such 
criteria did exist, empirical data can identify items 
that may require revision, but empirical data can seldom, 
if ever, dictate that an item must be revised or elimi-^ 
nated. Thus , at^ least at the present time, any total 
evaluation of a criterion-referenced test item necessi- 
tates a considerable amount of subjective judgment on the 
part of subject matter specialists. 
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The statistics and procedures discussed below have 
been culled from the literature or developed by the 
author. Thus, they represent a statement of the state- 
of-the-^art in criterion-referenced item analysis and 
revision procedures, basically from an empirical point 
of view. However, it should be understood that there 
is considerable discussTon and even some disagreement 
among researchers concerning the applicability of these 
statistics and procedures. Much work remains to be 
done. 

Item Statistics 

In this section we consider itt.m statistics rele- 
vant to criterion-referenced testing. Most of the 
statistics discussed here are reported in the litera- 
ture; the others were developed by the author and 
are offered for consideration. The reader will note 
that we consider two kinds of statistics for items: 

(a) measures of state (i.e., measures that reflect 
student performance at one point in time) and 

(b) measures of change (i.e., measures that reflect 
student performance at two points in time) . Also, for 
both .of these possibilities we consider statistics for 
describing the reliability and validity of an item. 

Most of the literature that discusses criterion- 
referenced item statistics treats these statistics as 
measures of state; however, criterion-referenced tests 
are often used to assess change, especially as the issue 
of change relates to the effectiveness of an instruc- 
tional system (see Chapter I) . 

Measures of state . For the most part, in criterion- 
re ferenceT^testTng , measures of state are expressed as 
difficulty levels or, less frequently, as error rates. 
The difficulty of an item is defined as the proportion 
of students who get an item correct. As such, the term 
••difficulty level" is somewhat of a misnomer in that if 
difficulty level is high then the item is easy, and if 
difficulty level is low then the item is "difficult." 
Since difficulty level is a proportion, its tange is 
zero to one. 

Error rate is defined as the proportion of students 
who get an item incorrect; it is mathematically equal to 
one minus the difficulty level, and its range is also 
zero to one. Thus, error rite contains all of the 
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information that difficulty level contains, and error 
rates do not suffer from the interpretation problem 
encountered with difficulty levels. 

Measures of c hancre . It is not our intent here to 
indulge in a lengthy ciiscussion of the measurement of 
change. We have previously discussed this issue to 
sc?:\e extent^ and it will be a subject of further discus- 
sion later. Here we merely want to identify major 
references relating to the measurement of change in 
criterion-referenced testing situations . 

One of the earliest empirical studies using 
criterion-referenced test data was performed by Cox 
and Vargas (1966) . The index they considered was 
simply the difference between posttest dif i^iculty level 
and pretest difficulty level. Hambleton and Gorth (1971) 
and Popham (1971) have also examined this index, and, 
in addition, Popham (1971) has considered various other 
statistics that depend upon change scores. For the 
most part, these authors have treated the indices they 
analyzed in a manner similar to the way discrimination indices 
are treated in norm- referenced testing. That is, the 
indices have been viewed primarily as statistics for 
identifying "bad" or "atypical" criterion-referenced 
test items. 



It should be pointed out that one could also argue 
that these indices are measures of instructional effec- 
tiveness. In fact, in criterion-referenced testing 
situations in instructional environments, Ivens (1970) 
and Brennan (1970) treat measures of change primarily 
as measures of instructional effectiveness, and only 
secondarily as indices for identifying poor criterion- 
referenced test items. The measure of item change 
proposed by Ivens (1970) has been introduced in Chapter 
IV, and will be discussed again below. Brennan (1970) 
has suggested the consideration of indices called 
"percentage of maximum possible gain" and "percentage 
of maximum possible effectiveness." 

Item reliability measures of state. For the 
classical test theory model, the reTiabili ty. of an item 
(in an internal consistency or equivalence sense) is 
usually calculated b/ determining the intraclass correla- 
tion coefficient using Hoy t ' s (1941) analysis of variance 
framework. This technique has been considerably extended 
recently by the work of Cronbach et al (1972). The 
intraclass correlation coefficient may be an appropriate 
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measure for criterion-referenced item reliability (in 
an equivalence sense) if: (a) the variance of the total 
scores over items is not close to zero, (b) all items 
are measures of the same objective, and (c) one accepts 
(and the data fulfil) the implicit assumptions entailed 
in using the intraclass correlation coefficient as a 
measure of criterion-referenced item reliability. 

In the following paragraphs we consider a number of 
indices that have been proposed specifically for the 
purpose of calculating item reliability in criterion- 
referenced situations. 

Ivens' (1970) measure denoted AI . in Chapter IV 
provides us with a measure of itein ^ reliability, in 
either an equivalence or stability sense, when all 
subjects take both parallel items or when all subjects 
are administered the same item twice, respectively. 
Since this index is a reliability (R) index for a measure 
of state (S) , let us denote this index as RS; 

Now suppose we have two items which are intended 
to be equivalent measures of a particular objective. 
It is not always feasible or desirable to have subjects 
take both items, yet we usually do want a measure of item 
reliability. Let us now consider a procedure, offered 
by this author, for obtaining item reliability, given 
two supposedly parallel items, when (a) all students are 
randomly assigned to one of two groups, and (b) students 
in group one respond to the "first" item and students in 
group two respond to the "second" item. 

Let us denote persons in group one as: 

A j , j = 1,2, n 

and persons in group two as: 

Bj^ , k — 1,2, n . 

Now, if we considered studentt^ A. and Bj^ to be the same 
persons when j = k, we could calculate the 

index RS and have a measure of item equivalence. However, 
since the order of the persons' subscript is arbitrary, 
the resulting index is only one of a large number of 
possibilities; for example, we could just as well ha /e 
considered A^ and ^2 ^3' be the 

same person. However, we can extend this rationale to 
obtain what may be a reasonable index. 
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The procedure is as follows: (a) calculate RS for 
each distinct way of pairing persons in group one with 
persons in group two and (bj average the RS indices in 
order to obtain RS*, which will denote the desired 
measure of equivalence. This procedure can also be 
applied when there are unequal numbers of subjects in 
groups one and two, say 

n^^ = number of subjects in group one, and 

nj - number of subjects in group two, where 

^2 ^ ^1 • 

The procedure indicated above is, however, cuFjDersome 
because there are 

(n2) (n2 - l)(n2 - 2) (n2 - n^ + 1) 

different ways of pairing the n^ subjects in group one 
with subjects in group two. 

Therefore, it is fortunate that the above procedure 

is mathematically equivalent to calculating RS when we 
treat 

A, and B, . and B, ... A^ and B, 

112 1 n, 1 



A, and B^ A^ and B^ ... A^ and B^ 

12 2 2 n^ 2 



*1 
^1 



A, and B A^ and B ... A and B 

1 nj 2 n2 n.^ 

as ^^^^2 ^iff^^^^^ subjects. That is, we examine all 

possible pairs of subjects, where a pair is defined as 
one person from each group. If the item score for both 
persons in a pair is the same, then that constitutes an 
agreement, and RS* is the number of agreements divided 
by the total number of pairs. Letting 

c, = number of subjects in group one who got 
"first" parallel item correct, and 

c^ = number of subjects in group two who got 
"second" parallel item correct. 



it can be shown that: 
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c^Cj + (n^ - c^) -^2^ 
RS* = 

- P1P2 + ^1^2 • 

» 1 - Pj^ - P2 + 2p^P2 , v?here Pi=c^/n^ 

q2=(n2-C2)/n2 
The index RS* has a range of zero to one. 

Another measure of item reliability (strictly in 
the sense of equivalence) is suggested by Sabers and 
Kania (1972) . Let us identify the two supposedly 
parallel items as items j and j', and let us display 
the data in the following form: 



Item j* 



Item j 





Pass 


Fail 


Pass 


A 


B 


Fail 


C 


D 



where A = number of students who passed both items, 
B = number cf invalid passes on item j, 
C = number of invalid passes on item j ' , and 
D number of students who failed both items. 

Sabers and Kania define the "index of item precision" 
for items j and j", respectively, as: 

P . = 1 - B/N 
J 

and ,= 1 C/N , 

where N is the total number of subjects. 
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Using these two indices of item precision, Sabers and 
Kania define the item reliability coefficient as: 

RS** = 0.5(Pj - Pj.) (1 - |Pj - P. , I) . 

This coefficient (which they call the KI coefficient of 
item equivalence) has a range of zero to one. Sabers 
and Kania claim that the higher the value of RS** "the 
greater the degree of agreement between the decisions 
made by the two forms • " 



The reader will note that all of the above techniques 
are applicable only when we have two parallel items or 
two administrations of the same item. A procedure 
suggested by Brennan and Stolurow (1971) is applicable 
for any number of parallel items • Brennan and Stolurow 
note that, in classical test theory, the statistical 
criteria for K parallel tests are that the K means, the 
K variances, and the K(K"l)/2 intercorrelations be equal. 
When one has K items, instead of K tests, the same 
criteria would seem to be appropriate. If item scores 
constitute a multivariate normal distribution, then the 
above assumptions can be tested using a procedure 
developed by Wilks (1946) • However, in most criterion- 
referenced testing situations one cannot justifiably 
assume that item scores constitute a multivariate normal 
distribution. In sue? caF^s, in order to test the 
equality of the K means and the K variances one can 
use Cochran's Q test (see Siegel, 1956); however, this 
author knows of no appropriate statistical procedure 
for testina the equality of the intercorrelations in 
the absense of a multivariate normal distribution of 
item scores. Therefore, in most criterion-referenced 
testing situations, at least at the present time, 
researchers will have to make subjective "Judgments 
about the equality of item intercorrelations. 

Item validity measures of state . The usual 
measure of item validity is a discrimination index, 
which compares item scores with scores on some criterion. 
For most criterion-referenced test^, total test score is 
•usually the only available, appropriate criterion. h 
number of correlational type discrimination indices have 
been reported in the literature; however, for criterion- 
referenced testing they have the disadvantage of being 
severely affected by small amounts (or lack of) variance 
in the distribution of item and/or test scores. (See 
Brennan, 1970, for a more complete discussion of such 
ind''ces.) Therefore, this author recommends use of the 
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following index discussed in detail 



by Brennan 



(1972) : 



VS « (c /n ) - (c,/n.) , where 
u u 11 

c = the number of students in the upper group 
who got the item correct, 

the. number of student.^ in the lower group 
who got the i^em correct, 

the total number of students in the upper 
group r and 

the total number of students in the lower 
group. 

In Brennan (1972) the VS index is called the B index; 
here, we have chosen the designation "VS" to indicate 
that the index relates to item validity (V) for 
measures of state (S) . This index has a lower limit 
of -1 and an upper limit of +1. For mastery testing. 
Upper and lower groups would usually be defined in terms 
of the mastery cutting score. It may be appropriate, 
in some cases, to eliminate from consideration students 
close to the mastery cutting score, since such students 
are on the borderline of mastery. 

Both Popham and Husek (1969) and Brennan (1972) 
agree that for criterion-referenced testing: (a) nega- 
tively discriminating items are undesirable, (b) non- 
discriminating items are not necessarily bad items, 
and (c) positively discriminating items may indicate 
iraffective instruction. Brerlnan (1972) also points out 
that if all students get an item correct, then the VS 
index equals zero. Therefore, if it is desirable that 
all students get an item correct, the the "ideal" value 
of VS is zero, and, hence, the ideal item is a non- 
discriminating item* Following this line of reasoning, 
even positively discriminating items (and certainly 
negatively discriminating items) indicate that either the 
test item or instrucrtion may require revision. 

Item reliabili ty measures of change . In order 
to address this Issue, let us define the following: 

X^j^ - pretest response (0,1) of person i to the 
first administration of an item (or to the 
first of two parallel items). 



u 



n = 
u 



n, = 
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X.J = pretest response (0,1) of person i to the 
^ second admininstrat ion of an item (or to the 
second of two parallel items) , 

^il ^ posttest response (0,1) of person i to the 
first administration of an item ^or the 
first of two parallel items), 

Y.J - posttest response (0,1) of person i to the 
second administration of an item (or to the 
second of two parallel items) , 

D, . = Y. _ - X. , = 0, 1, or -1, and 
x 1 11 11 

°i2 = - ^i2 = °' 1' °^ -1 • 

Now, the reliability of an item as a measure of 
change can be expressed as the number of subjects for 
whom D^^ = divided by the total number of subjects. 

To be consistent with our designation for other 

indices ^ let us denote this index of item reliability (R) 
for a measure of change (C) as RC; the reader will note 
that the range of RC is from zero to one. 

It is important to notice that , for the RC index, if 

D^^ « = 0 for all subjects i, 

then RC = 1; i.e., the change score reliability of the 
item is perfect eventhough, for every student, no 
change has occarred. This is not a contradiction. 
The fact that RC - 1 merely indicates that the item is 
perfectly reliable when used as a measure of change; 
this fact does not say anything about the amount of 
change or the direction of change. 

Item validity measure of change . In order tc 
construct such an index, we must have some criterion 
for change. One possible criterion (although not 
necessarily a good one) is the set of student scores, 
each of which is an average item change score, where an 
item chanqe score is de f ined as posttest item score minus 
pre test item score . Us ing the notation in the previous 
section ?nd replacing the second subscript by an item 
subscript j , a person ' s average drhange score is : 

K 

D. = (1/K) Z (Y. . - X. .) , where 
1- j=i 11 
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j ** 1,2, ...f K items ^ and 



-1 < D. < 1 . 
— 1. — 

Using the above scores (or some other change 
score criterion if available and appropriate) one can 
define upper and lower groups . Then one can compare 
the trichotomized item change scores with the dichoto- 
mized criterion change scores using the following table: 



Item Change Score 



-10 1 



Upper 
Group 

Lower 
Group 



P(U,-1) 


P{U,0) 


P{U,1) 


P(L,-1) 


P{L,0) 


P(L,1) 



In this table P(U,-1) means the proportion of students 
in the upper group who got an item change score of -1; 
the other cells are interpreted in a similar manner. 
Using the above table one can examine the validity of 
the item as a measure of change; however, no single 
statistic with a range of -1 to +1 appears to be 
readily available from this table for the purpose of 
assessing the extent to which the item is a valid 
measure of change. Of course, one could obtain a single 
statistic merely by correlating item and criterion 
change scores, but the appropriateness of such a pro- 
cedure needs to be examined for criterion-referenced 
testing situations. 

Other possible indicesfor assessing item validity 
as a measure of change include Jenkins' (1956) triserial 
correlation coefficient and Saupe's (1966) change index. 
The latter is defined as: 

Corr{Y^^ - X^^, Y.^ - J , where 

^ij ~ posttest score for person i on item j, 

^ij " pretest score for person i on item j, 

Y. = total number of items correct on the posttest 
for person i , and 

X^^ = total number of items correct on the pretest 
for person i . 
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Also^ Ivens (1970) suggests two indices that might 
be used to assess item validity as a measure of change; 
however, Ivens' indices involve three sources of infor- 
mation pretest, posttest, and retest (or retention 
test) . 

It should be noted that the interpretation of any 
of the above indices is, or course, confounded by the 
presence of intervening instruction between pretest and 
posttest. Therefore, if the item does hot appear to be 
valid when used to measure change, the problem may lie 
with the item, the instruction, or both. 

A Decision Process for Identifying Cri ter ion - Re fere need 
Ttems and Ins truct ion t hat Require Revision 

In order to put the proposed decision process 
into a conceptual context, let us assume that we have 
an instructional program teaching a set of terminal 
objectives. Chronologically, each terminal objective 
is tested by a pretest item that occurs befora the 
objective ha^ been taught and a posttest item that 
occurs "some time after" the objective has been 
taught. Furthermore, we will assume that all of the 
items testing any objective are identical or equivalent. 

In the final analysis, using item performance 
data, we want to identify those test items and sections 
of instruction (relevant to a given objective) that 
require revision. The decision process we propose wil'^ 
not necessarily tel] the evaluator how to revise items 
and/or instruction, but the process will provide objective 
rules for deciding what to revise. (A previous version 
of the process proposed here is provided by Brennan 
and Scolurow, 1971). 

Types of data and de ci s ion . Most of the decision 
rules discussed below make use of error rates and 
discrimination indices. An observed error rate for an 
item is the proportion of subjects who get the item 
incorrect; therefore, error rate is equal to one minus 
difficulty level. There are a numl)er of discrimination 
indices that have been proposed in rhe literature; 
however , the appl icabi 1 i ty of many of them in cr iter ion- 
refereured testing situations is open to question. 
Therefore, in general, we suggest using Brennan's (1972) 
B discrimination index (designated as VS in 
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the previous section of this chapter) . 

For many of the proposed decision rules we w.rll 
assume that error rates are classified as either 
high (H) or low (L) , aqd that the evaluator predeter- 
mines dn appropriate cut-off point between high and 
low error rate. For any given objective, the cut-offs 
for the error rates discussed below must be identical 
in order to apply the rules chat will be specified. 
Also, in most cases, the cut-offs chosen will probably 
be the same for all objectives; however, occasions can 
arise when certain ob j ectives should hav e a higher 
(or lower) error rate cut-off than other objectives. 
For example, items testing very crucial objectives 
might be assigned a cut-off of 0.10, while other items 
might have a cut-off of 0.25. 

Discrimination indices will be classified as 
either positive (+) , negative (-) , or non-discriminating 
(0) . By positive and negative indices we mean indices 
that discriminate significantly (at some appropriate 
a level) in the positive and negative directions, respec- 
tively. 

Before instruction we can obtain two kinds of data 
for each objective that has a pretest item: 

(a) the Theoretical Error Rate (TER) , which is the 
expected proportion >f students getting a pretest item 
incorrect simply on the basis of random guessing; 

i.e., if "a" is the number of possible answers to an item, 
then 

TER ^ (a - l)/a . 

For example, if an item has five alternatives, we would 
expect 80 percent of the students to get the item 
incorrect simply by guessing randomly. Items that have 
a virtual infinitude of possible answers have TER = 1; 
however, the evaluate^* should be careful not to assume 
that e%*sry f re^-response oi^ open ended test item has 
TER = 1. Very often such items are so worded that only 
two or three answers are possible, in which case TER = 
0.50 or TER = 0.67. 

(b) the Base Error Rate (BER) , which is the observed 
proportion of students getting a pretest item incorrect. 
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After instruct ion we can obta in two types of data 
for each objective that has a posttest item: (a) the 
Posttest Error Rate (PER) and (b) the Posttest Discrimi- 
nation Index (PDI) . 

In subsequent sections we will anlayze the decisions 
that can be made on the basis of the above data. Then 
we will discuss the decisions that can be made based 
upon the arithmetic differences between various error 
rates . 

For each decision rule presented we will give our 
reasons for specifying whether test items or instruction 
relevant to a give objective should be revised (R) , 
questioned {?), or not revised (NR) . These decisions 
should not, however, be interpreted too strictly; the 
■evaluator will still have to use some degree of 
subjective judgment. For example, when we say, in 
subsequent discussions, that an item should be revised 
(R) , we mean that our best guess on the basis of the 
data is that the item should be revised, but the evalua- 
tor must make the final decision. Also, when we say 
that an item (or instruction) is questionable (?), we 
mean that the data are not sufficient to make a definite 
judgment about whether or not the item (or instruction) 
should be revised. 

One additional consideration deserves mention . 
Ideally, one would validate his test items prior to 
using them in an instructional system; however, this is 
often not feasible, especially when criterion-referenced 
tests are used in an instructional system. Therefore, 
in most cases, evaluation must take into account the 
possible invalidity of both test items and instruction. 
For this reason, most of the decision rules that will 
be presented are based upon the assumption that we have 
no a priori reason to believe that test items are more 
valTd than instruction or vice-versa. 

P retest data . it i^; not ] ikely that only pretest 
data v^ould be used to rr^akv^ decisions about test items, 
yet it is useful to consider the types of decisions 
that are appropriate on the oasis of such data. 

Rule 1: If TER and BER are both the same 
(i.e., H,H or L,L) then no necessity for revision is 
indicated. In this case, the observed error rate (BER), 
which is not affected by instruction, is approximately 
the same as the expected error rate (TER) . 

Rule 2: If TER low (L) and BER is high (H) , 
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there is no indication that revision is required. This 
rather anomalous case could arise if the particular 
objective for the item involved concepts that are typi- 
cally misunderstood. For example, many students (in 
the author's opinion) believe that "inflammable" and 
"flammable" have different meanings. If an item were 
constructed testing whether or not "inflammable" and 
"flammable" have the same meaning, and if this item were 
given prior to instruction, it is quite possible that 
more students would get the item incorrect than we would 
expect on the basis of the theoretical error rate (TER) . 
In this case, there is no reason to revise the item; 
rather, we expect that the instruction will correct the 
students ' misinformation . 

Rule 2' '^ER is high (H) and BER is low 

(L) , then the item will probably need to be revised. 
In this case, students, without benefit of instruction, 
are performing considerably batter than expected. 
It appears that the item itself may be teaching or 
that one or more distractors are so easy that many 
students can pick the correct answer largely by a 
process of elimination. It is also possible that the 
item is not at fault and the objective, while being 
easy for most of the students, is considered to be an 
integral part of the total set of objectives. In this 
case, of course, the item would not be revised. 

These rules, as well as all other rules that will 
be discussed, are given in abbreviated form in Table 5"1. 

Posttest data . As a result of administering a 
posttest two types of data can be collected: the 
Posttest Error Rate (PER) and Posttest Discrimination 
Index (PDI). Since these data are collected after instruc- 
tion, theoretically decisions can be made about either 
test items or instruction or both. However, from a 
practical point of view, if revision seems to be required, 
if is difficult to specify with any confidence that 
the fault j^ies solely with the test item or solely with 
instruction. In short, based upon posttest data, we 
can usually say whether or not something is wrong, but 
given only the posttest data it is difficult to pinpoint 
the problem. 

Rule £: If PER = L and PDI = 0, then neither 
the item nor the instruction need to be revised. This 
is the best possible situation , s ince he optimal condi- 
tions for both error rate and discrimination index are 
fulfilled; i.e., at the end of instruction we hope that 
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. TABLB 5-1 
Rules for Decision - Making 



Error Rates Decision 



Rule • 

No. TER BER PER PDJ Item Instruction 



1 


H 


H 






NR 


— 




L 


L 






NR 




2 


L 


H 






NR 




3 


H 


L 






R 




4 






L 


0 


NR • 


■ NR 


5 






L 


+ 


• 


? 








L 




? 


•> 


6 






H 




R 


R 


•7 






H 


+ 


? 


R 








H 


0 


? 


R 


8 


DER > 


0^ 






R 






DER < 


0^ 






NR 





9 PMPG < c — R 

PMPG > c^ — NR 



^"NR" ireans no revision required; "R" means revision 
is required; means the data are not sufficient to make 

a sound judgment about whether or not revision is required. 

^DER is significantly greater than zero for a one- 
tailed test of significance. 

^DER is not significantly greater than zero for a 
one-tailed test of significance. 

^c is a cut-off chosen by the evaluatot"^ 
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most of the students get the post-test item correct 
(PER =L) , and that the item is non-discriminating (PER 
» 0) . 

Rule 5_: If PER = L and PDI - + or - , then 

both the item and the instruction are questionable. 

The fact that FDI is clearly non-zero indicates a pos- 
sible need for revision. 

Rule 6_: If PER = H and PDI = then both the 
item and instruction should be revised, since PER = H and 
PDI = - is the worst possible situation that can occur. 
It is possible that either the item or the instruction 
is at fault, but not both; however, we assume here that 
the most universally applicable decision is to check 
both the item and the instruction to see what revisions 
are needed. 

Rule 7: If PER = H and PDI = + or 0 , then the 
instruction should be revised and the item should be 
questioned. Whenever error rate is high after instruc- 
tion, something is wrong, but without additional 
information we do not know whether the fault definitely 
lies with the item or the instruction. However, the 
author believes that evaluators are apt to be more confi- 
dent about test items than they are about instruction; 
it is also possible cnat the test items have been pre- 
viously validated or partially validated. Therefore, 
in this case, it seems reasonable to place a less strin- 
gent decision on the item than on the instruction. It 
should be noted, however, that perceptions can be biased? 
i.e., the test item could be at fault. It is certainly 
advisable to analyze *:he nature of any validation or pre- 
validation activity for its applicability in the present 
context since sampling, testing, and teaching conditions 
can vary considerably. 

Decis ions based upon differences between error rates . 
Most of the foregoing decision rules are dependent upon 
the evaiuator's choice of a cut-off between high and low 
error rate. Dichotomizing error rate in this way clearly 
facilitates the identif icati ;n of appropriate decision 
rules , and , in many cases , tue simplicity of the technique 
will probably ortweigh any loss of precision. However, 
we can also specify an additional pair of decision rules 
that f^ke into account q* ntitative ^ f f erences between 
error rates. One of V' i-^ rules increases the 
precision of previous decisions, the other provides 
essentially new information. We will call these error 
rates "derived" error rates to distinguish them from 
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the "raw'* error rates discussed in the previous sections. 

Let us consider two limitations of the high/low 
classification procedure for error rates. Suppose 
that Theoretical Error Rate (TER) and Base Error Rate 
(BER) for a given objective are both classified as 
high (H) , while the Posttest Error Rate (PER) is clas- 
sified as low (L) . Clearly, any actual arithmetic 
differences between TER and BER will not affect the 
decisions we have thus far proposed. Also, since BER 
and PER are merely classified as high and low, 
respectively, we will not have a quantitative measure of 
how much learning has actually taken place. 

Rules 1-3 are useful for making decisions based 
upon categorical differences between BER and TER, but 
we can make more accurate decisions by actually computing 
the difference between these error rates. Let 

DER = TER -BER, 

where DER stands for "Difference Error Rate." If DER = 0, 
then the observed error rate on the pretest (BER) is 
identical to the expected error rate on the pretest 
(TER). If DER < 0, then fewer students are getting the 
item correct than we would expect on the basis of random 
guessing. Finally, if DER > 0, then more students are 
getting the item correct than we would expect on the basis 
of random guessing. As discussed previously, the last 
possibility is often an unfavorable situation, since it 
can mean that the item somehow "gives away" the correct 
answer. 

We can test the significance oa a positive 
difference between BER and TER by computing 

DER - (1/2N) 



\/TER(1 - TER)/N 

where N is the total number of students in the saruple. 
The term -1/2N is a correction for discontinuity and, as 
such, Cc.n be dropped if the sample size is large. Note 
that when TER = 1 Z is undefined; in this case, any value ' 
of DER > 0 can be considered significant. Again, however, 
one should be careful not assume that TER = 0 just because 
the format of the item is free-response. Once Z is 
calculated its significance can be tested by comparing 
the value of 2 with the normal curve standard score at 
an appropriate a-level for a one-tailed test. rJote that 
>/e are only interested in positive values of DER. 
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We can now specify more precise version of 
Rules 1-3. 



Rule 8: If the value of DER is significantly 
greater than zero, then the item should be revised • 
In all other cases no revision is required. 

None of the decision? discussed up to this 
point has made use of any measure of gain in knowledg? 
relevant to a given objective that results from the 
instructional system. It is probably true that gain 
is not as important as final performance on the posttest, 
in must instructional systems; however, if students 
experience relatively little gain as a result of 
experiencing instru<:t ion , one can legitimately question 
the value of the instructional system itself. Thus, 
measures of gain have long been a subject of considerable 
interest in the field of instruction, 

A simple measure of gain for an objective is the 
difference between pretest error rate (BER) and posttest 
error rate (PER) • This measure has been suggested by 
Cox and Vargas (1966); however, it has one serious 
limitation gains of the same magnitude do not mean 
the same thing. Consider a gain of 0.50 resulting from 
BER = 1.00 and PER = 0.50 and again of the same magnitude 
resulting from BER = 0.50 and PER =0.00. In the former 
case, the instructional system has failed to produce 
50 percent of the gain in performance that could be 
achieved, while in the latter case, the instructional 
system has produced as much gain as possible given the 
entry level of the students. Thus, in the former case, 
some revision of the instruction may be desirable, while 
in the latter case, no revision in the instructional system 
is required on the basis of these data. 

The above, rather trivial example, illustrates that 
simple gain does not provide a very meaningful basis for 
revising instruction. A better measure is percent of 
maximum possible gain for an objective defined as: 

BER - PER 

PMPG - 

BER 

In order to make use of this measure the evaluator must 
specify a cut--off that determines whether or not a given 
value of PMPG indicates a need for revision. 
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Rule 9: Ir PMPG < c, where c is a cut-off speci- 
fied by the evaluator, then the instruction should be 
revised. The cut-off c need not bethe same for all 
objectives. If PMPG >^ c, then, on the basis of PMPG, 
there is no indication that instruction needs to be 
revised. 

The literature contains many in-depth discussions 
concerning the probl^^ms and pit-falls assoc-.ated witii 
measures of gain . See for example Cronbach and Furby 
(1970), DuBois (1962), and Harris (1963). Most of this 
litarature, however, treats measures of gain in the 
context of their use in inferential statistics or 
correlational analysis. While we appreciate the impor- 
tance of these issues, we hasten to add that measures of 
gain, merely as descriptive statistics, can provide 
useful information to evaluators. .9 believe that the 
use of PMPG, as data for evaluation purposes, is a 
case in point. Also, since, in criterion-referenced 
testing, we assume an absolute measurement scale, many 
of the objections to measures of gain are less crucial 

Use of Item Ar alysis Tables 

An item analysis table indicates the number or 
percent of students who chose each of the alternatives 
of a test item. Further, in most cases, the students 
who responded to the item are partitioned into groups, 
based upon their performance on the total test. Thus, 
for example, if each student is put into either a 
"lower" or an "upper" group, then one can identify the 
number (or percent) of students in the lower and/or 
upper group who chose each alternative. Such tables, 
and their use in norm-referenced testing situations, are 
treated in practically every introductory text in 
educational measurement . 

Item analysis tables can also be quite useful in 
criterion-referenced testing situations. Let us now 
consider some of the issues involved in constructing and 
using such tables. 

(a) The classification of students into groups 
should be meaninoful for the criterion-referenced test. 
For norm-referenced tests typical ways of partitioning 
students include , for example , placing the top 50 percent 
of the students in the upper group and the bottom 50 
percent in the lover group, or partitioning students into 
lower, middle, and upper thirds. These procedures are 
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not appropriate for ciriterion-r-ef erenced tests, because 
the group into which a student is classified can be 
determined only by reterenceto the scores of other 
students. For criterion-referenced testing, the group 
into which a student is .classified should be uniquely 
determined by the student's test score, independent 
of the scores of other students. In mastery testing 
this usually means taat srudents who exceed the mastery 
cutting ccore are defined as the upper group of students, 
and all other students constitute the lower group. 
Thus, f ^r criterion-referenced item, analysis tables, 
groups are defined according to ranges of criterion- 
referenced or mastery test scores. In majiy cases, only 
two groups (upper and lower) are used; however, item 
analysis tables often provide more useful and inter- 
pretable information if one incorporates a "middle" group 
that contains students whose test score is on the border- 
line of mastery, acceptable behavior, or criterion 
performance . 

(b) In interpreting critt rion-ref erenced j^tem 
analysis tables one should remember that if all students 
get all items correct , then all cells but one in every 
item analysis table will be empty. Furthermore, if all 
students get an item correct, then the only cells that 
will be non-empty are those associated with the correct 
alternative. These observations may appear trivial; how- 
ever, they do emphasize an important consideration — 

in criterion-referenced testing, the fact that few, or 
no, persons choose an incorrect alternative (distractor) 
does not necessarily indicate that the alternative 
should be revised. 

(c) For the sake of discussion let D(U,L) be the 
difference between the proportions of students in th? 
upper and lower groups who choose a distractor, D. 
One would usually expect D(U,L) to be equal to or less 
than zero; therefore, if D(U,L) is very much greater 
than zero, the di tractor or the item itself may require 
revision. Analogous statements referring to the correct 
alternative a- j contained in the above discussion 
concerning item validity and the B discrimination index. 

(d) The process of analyzing criterion-referenced 
test items that require revision, can be conceived as a 
two-stage process. The first stage entails the use of 
decision rules such as those discussed in th previous 
section of t.iis chapter; the second stage entails a 
detailed consideration of the item anajLysis table (s 

(It is sometimes useful to study the iteri analysis 
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tables for both the pretest ariu posttes': administrations 
of the item.) Unfortunately, at least at the present 
time, thB use of item analysis tables is probably more an 
art than it is a science. Nevertheless, careful subjective 
analysis of item analysis tables will often reveal the 
presence of problems that are not apt to be evident from 
the typical kinds of descriptive statistics for items. 
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CHAPTER VI 



An Alternative to the Classical 
Administration and Scoring Procedure 
For Analyzing Criterion - Referenced Test Items 

In Chpater V we considered, in some detail, proce- 
dures for analyzing criterion-referenced test items when 
students are forced to pick one and only one alternative 
and scored either correct (1) or incorrect (0) . With 
very few exceptions, researchers in the field of 
criterion-referenced testing have concerned themselves 
only with this classical procedure for the administration 
and scoring of items 

For norm- referenced testing classical correct/ 
incorrect administration and scoring procedures seem to 
be reasonably effective and useful. However, norm- 
referenced tests are usually relatively long; the scores 
from such tests are often normally distributed; floor 
and ceiling effeccs seldom occur in norm-referenced tests; 
and, most importantly, one is not very much concerned 
about the precise proportion of items a student can 
answer correctly rather, one is concerned about the 
ability of the test to distinguish among subjects. Each 
of these characteristics of a norm-referenced test argues 
directly or indirectly that the classical correct/incorrect 
procedure is reasonably adequate (or, at least, not 
grossly inadequate) for many norm-referenced tests. 

On the other hand, criterion-referenced "-.ests are 
usually short; the scores from srch tests are often 
negatively skewed even severely so; ceiling effects 
are very common; and, most importantly, one is funda- 
mentally concerned aboui: accurately estimating the pro- 
portion of items to which a student knovrs the answer 
(or possibly some other score). This emphasis on accurate 
estimation of a student ' s score is especially crit ical in 
criterion-referenced testing because thfere is seldom any 
external criterion measure for judging validity. 

Thus, in criterion-referenced testing it is very 
important to use every possible means of eliminating 
random (and systematic) errors of measurement . In 
particular, it seems tc th\s author that it is important 
to eliminate (or , at least , be able to estimat the effect 
of) guessing. Now, it is very clesr that, a considerable 
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amount of student guessing frequently occurs when a 
student if forced to pick one and only one alternative 
and the classical correct/incorrect scoring procedure is 
used; moreover, when the classical procedure is used, it 
is very difficult, if aot impossible, to ascertain the 
magnitude of the effect of guessing upon student scores. 

Furthermore , since criterion- referenced tests are 
frequently shotc, it seems desirable to obtain as much 
informations as possible from each item; yet, using the 
classical procedure for administering and scoring an 
item, one merely knows whether or not the student got 
the item correct . In particular , using the classical 
procedure one does not obtain information with regard to 
the relative attractiveness of each alternative for each 
student. This kind of information can be very useful 
in determining whether or not to revise a criterion- 
referenced test item. Thus, the classical procedure some- 
what limits the amount of information we obtain with 
regard to any given criterion-referenced test item. 

In short, from a criterion-referenced testing view- 
point, this author feels that the classical procedure for 
administering and scoring an item has two serious limita- 
tions: (a) scores obtained using this procedure incor- 
porate an indeterminable amount of guessing and (b) this 
procedure provides very little information w?th regard to 
any given item especially when relatively small numbers 
of students take the item. These points imply that when 
we use the classical procedure i jr criterion-referenced 
testing, we may have less than adequate information for 
determining whether or not a criterion-referenced test 
item requires revision. 

Therefore, it is worthwhile to consider alternatives 
to the classical procedure. There are a number of points 
of view from which onr could consider different procedures 
Here we are interested in the ability of the procedure 
to aid us in item analysis . That is , our goal is to 
identify a procedure for administering an item that 
provides us with optimum data for determining whether or 
not the item needs to be revised; and, if possible, these 
data should aid us ir pinpointing the nature any 
difficulties with the item. For this purpose, we consider 
two potential procedures which we call the "elimination 
procedure" anr. the "confidence procedure-" We find that 
the confidence procedure is the better of the two for 
our purposes. 
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It should be noted that here we are not concerned 
about the kinds of scores typically obtained from the 
elimination and confidence procedures; rather, our 
primary concern is with the nature and amount of data 
collected when such procedures are used. Also, we do 
not assume that once an item is administered, using one 
procedure it will always be administered using that 
procedure. .In fact, when we consider the confidence 
procedure, the manner in which we interpret the data 
provides us with a kind of guessing-f ree estimate of 
a person's classical score. Thus, once an item has 
been validated us ing the confidence procedure , one can 
administer the item using the classical procedure. 

Two Alternatives to the Classical Procedure for 
Adminls terin q Items 

Elimination procedure . Coombs et al (1956) suggest 
a procedure for administering and scoring a test based 
upon having students eliminate alternatives that they 
consider to be incorrect. Since a student may eliminate 
any number of alternatives for a.iy test item, the 
elimination procedure provides some information about 
the relative attractiveness of each alternative. 
However, the information provided is somewhc.t ambiguous 
in that, for example, if a student eliminates two alter- 
natives, we so not know whether or not the student feels 
more uncertain about one alternative than the other. 

Also, let us consider the elimination procedure from 
another point of view. As indicated previously, we are 
interested in a procedure's ability to provide us with 
a kind of guessing-f ree estimate of a pemon's classical 
score. Let us call such an estimate a PCI score, 
indicating the probability (P) that a person's classical 
(C) score on an item is unity (1) . If we know, for 
example , that a person guessed randomly on a four- 
alternative item, then PCI should be 0.25. The question 
is, "Can the kind of data collected using the elimination 
procedure provide ls with an adequate basis for estimating 
a student's PCl score for an item?" 

Suppose, for example, that a student eliminates 
two alternatives foi a four-alternative item. If we 
could assume that, when forced to pick one and onlv one 
alternative, the student would randomly pick one of the 
two non-eliminated alternatives, ther the PCI score for 
the student for the item would be 0.50. However, this 
assumption 's not necessarily valid; in fact, one could 
argu^^ that PCI m-^ght be any value between 0.50 and 1.00. 
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Thus, it does not appear that the elimination procedure 
provides an adequate basis for estimating a student's 
PCI score for an item. Consequently, if the student 
were administered the item a large number of times, 
we don't have a very good basis for estimating the number, 
or proportion, of times the student would get the item 
correct under the classical scoring procedure. If the 
item is administered K times, this proportion should be 
K.PCI. 



Confidence procedure . In confidence testing, one 
obtain^^ Fi"om each student a subjective probability that 
each alternative of a test item is correct. There are 
a number of techniques that car be used to obtain these 
probabilities either directly or indirectly. This author 
prefers the technique usually called the "star" method 
in which a student is told to distribute a fixed number 
of "stars" or points over the alternatives of a test 
item. For example, students might be told to distribute 
twelve points over the alternatives of a four-alternative 
item. The table below indicates some of the ways students 
might perform this task aAd the associated (subjectivej 
probabilities . 
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1 
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.42 


.08 


.08 


.50 
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5 
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.42 


.17 


.33 


.08 


1.00 



The rec .er interested in a more in-depth discussion 
of conf idenr testing can consult deFinetti (1965), ^ 
Echternacht 1972), Savage (1971), and Shuford et al (1966). 
A great deal, of the literature on confidence testing 
involves dl ; ussion of various procedures fcr scoring such 
tests, but :^is is not our concern in this chapter. 



Appendix A to this report is a manual for DEC-TEST, 
a computer program that analyzes confidence test data 
in great derail. Further, the introduction to this manual 
provides a iescription of confidonce testing and elimination 
testing as these procedures are typically used. 
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Here we are concerned about the nature of the data 
(i.e., the probabilities) collected for each item and 
for each student. 

Each probability i^ndicates how confident the 
student is that the particular alternative is the correct 
answer for the item. Using these probabilities we can 
obtain PCI scores from the following rules: 

Let M = the magnitude of the highest probability 

for a particular student for a given item, 

A = the number of alternatives for the item, 

P(a) = the probability associated with alternative 
a (a - 1 , 2 , . . . , A) , and 

* = the correct alternative. 

Now, 

PCI = 0 if P(*) y M; 

PCI = 1/K if ?(*) ^ M and there are (K-l)other 
alternatives having P(a) = M; and 

PCI = if P(*) = h and th^re are not othe^ 
alternatives having P(a) = M. 

See the table on the previcas page for examples of PCI 
scores. Note, in particular, that the third and fifth 
students both have PCI - 0.50 eventhough M = 0.50 for 
the third student and M = 0.42 for the fifth student. 

Thus, PCI scores are readily available from the 
subjective probabilities one obtains using the confi- 
dence testing procedure. Furthermore, whe^ one uses 
confidence testing as a procedure to collect data for 
items, one obtains, for each student, a probability 
associated with each alternative for each item. Thus, one 
has a great deal of inf ormat ion for each i tem much 
more information than if students pick one alternative 
or eliminate alternatives. 

In short, the confidence procedure seems to be 
superior to the elimination procedure, at least for out 
purposes here. 
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Item Analysis Tables from the Confidence Procedure 

Conisder the synthetic data for a hypothetical 
item presented in Table 6-1. The item has four alternatives, 
••a** is the correct answer, and the twr^nty students are 
partitioned into lower and upper groups of ten s tudents 
each. The confidence probabilities are indicated for 
each alternative and for each students We emphasize that 
these are synthetic data, and they are not necessarily 
indicative of a good criterion- referenced test item, 
we use these data merely to illustrate our discussion. 

For each confidence probability in Table 6-1, there 
is a pseudo-classical score. A pseudo classical score 
for an alternative is defined as the probability that a 
student would pick the alternative if the student were 
forced to choose one and only one alternative for the 
item under consideration. Thus, the pseudo-classical 
score for an item is the pseudo-classical score for the 
correct alternative; also, the pseudo-classical score for 
an item is identical to the PCI score discussed previously. 

Usi.ig the data in Table 6-1, one can construct the 
item analysis tables given by Tables 6-2 and 6-3, where 
Table 6-2 uses confidence probabilities and Table 6-3 uses 
pseudo-classical scores. Both tables present frequency 
distributions • of scores on alternatives, with associated 
totals, jieans, and standard deviations. Clearly, Table 
6-2 provides more information, and a somewhat different 
kind of information than Table 6-3; and, both tables 
provide much more information than is available from item 
analysis table? based upon the classical correct/incorrect 
scoring procedure. This additional information can be 
quite useful in deciding what (if anything) is wrong with 
a criterion-referenced test item. 

Now, let us sumrr.ar i ze a few points implicit in our 
discussion thus far. We are assuming that once an item 
is validated it probaoly will be administered using the 
classical corrr ^t/incorrect scoring procedure . However , 
in order to validate the item we are suggesting that the 
evaluator collect confidence probabilities for each 
alternative, translate these probabilities to pseudo- 
clas.^ical scores for each alternative, and generate tne 
pseudo-classical item analysis table. This table indicates 
the probability the each student would pick each alter- 
native using ti^ie classical correct/ incorrect scoring 
procedure; thus, using this table one can analyze the 
probable effect of guessing upon the performance of other 
similar students who take the item using the classical 
procedure for it ni administration and scoring. Further, 
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TABLE 6-1 
Synthetic Data 



Stu- 
dent 
No. 



Confidence 
Probabilities 

B C 



Pseudo-classical 
scores 

' B C I 



1 


.25 


.25 


.25 


.25 


.25 


.25 


.25 


. 25 


2 


.25 


.25 


.25 


.25 


.25 


.25 


.25 


.25 


3 


.40 


.40 


.10 


.10 


. 50 


.50 


.10 


.10 


4 


1 . 00 


.00 


.00 


.00 


1.00 


.00 


. 00 


. 00 


" 5 


. 30 


. 20 


. 30 


.20 


.50 


.00 


.50 


.00 


6 


.50 


. 50 


.00 


. 00 


. 50 


.50 


.00 


.00 


vJ 7 


. 30 


. 30 


.10 


. 30 


.33 


.33 


.00 


.33 


a 


20 


70 


. 00 


.10 


. 00 


1 . 00 


. 00 


.00 


9 


.40 


.20 


.00 


.40 


.50 


.00 


.00 


.50 


10 


.00 


00 


, .00 


. 00 


.00 


1.00 


.00 


.00 


Sum-L^ 


3.60 


3 . 80 


1 . 00 


1 .60 


3.83 


3.83 


1 . 00 


1.33 


Mean-L 


.36 


. 38 


.10 


.16 


.38 


.38 


.10 


.13 


SD-L 


.25 


.27 


. 12 


.13 


.28 


.45 


. 17 


.18 


11 


.25 


.25 


.25 


.25 


.25 


.25 


.25 


.25 


12 


1.00 


. 00 


. 00 


.00 


1. 00 


.00 


.00 


.00 


13 


1.00 


.00 


. 00 


.00 


1.00 


.00 


.00 


.00 




.70 


.20 


.00 


.10 


1.00 


. 00 


.00 


.00 


Ma 

0) 3 15 


.60 


.00 


. 20 


. 20 


1.00 


.00 


.00 


.00 


a 0 -1 c 
a M 


. 50 


.50 


.00 


.00 


.50 


.50 


.00 


.00 


ZD 17 


,40 


.50 


. 00 


.10 


. 00 


1.00 


.00 


. 00 


18 


.50 


.50 


. 00 


.00 


.50 


.50 


.00 


.00 


19 


. 80 


.10 


. 10 


.00 


1.00 


.00 


.00 


.00 


20 


. 30 


. 30 


. 30 


.10 


.33 


.33 


. 33 


.00 


Sum-U 


6.05 


2. 35 


.85 


.75 


6.58 


2.58 


.58 


.25 


Mean-U 


.61 


.24 


.09 


.08 


.66 


.26 


.06 


..03 


SD-U 


. 25 


.20 


.11 


.09 


. 37 


. 32 


.12 


.08 


Sum-T^ 


9.65 


6.15 


1 .85 


2. 35 


10.41 


6.41 


1.58 


1.58 


Mean-T 


.48 


. 31 


.09 


.12 


. 52 


. 32 


.08 


. 08 


SD-T 


.28 


. 25 


.12 


.12 


.12 


.34 


.15 


.15 



A pseudo- classical score for an alternative repre- 
sents the probability that a student would pick the 
alternative if the student were forced to pick one and 
only one alternative for the test item. 

^L, U, and T mean the lower, upper, and total groups, 
respectively . 
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if one wants a detailed display of the certainty with 
which students choose any alternative, one can generate 
the item analysis table based upon the confidence 
probabilities . 

Admittedly, the ideas discussed above require 

detailed procedures for item administration, scoring, 
and analysis; however, the additional time and effort 
required can, I think, be very worthwhile for the process 
of validating items • 

An Application of PCI Scores in the_ Classical Test 
TKeory Model 



Recall that under the classical test theory model 
X s T + E, where X, T, and E are observed, true, and 
random error scores, respectively • Now, we have 
described the PCI item score for a student as a kind of 
guessing-f ree estimate cf a person's classic3al score, 
and guessing is usually interpreted as one kind of 
random error. If we assume that guessing is the only, 
or the principal, kind of random error that concerns 
us, then a PCI score is a kind of true score and we 
can ^.nalyze the effect of guessing upon classical scores 
by using the classical test theory model directly. Thus, 
in this section we will let 

X = 0 or 1 (classical observed score) , 



T = PCI item score, and 



E = random error due to guessing. 

Basic statistics . Note that when one typically uses 
the classical test theory model, one has observed scores, 
and one wants to estimate true scores; however, in this 
case, we already have the true scores, and we must esti- * 
mate the observed scores. Now, if the item were admin- 
istered to student i a total of K times we would expect 
student i to get the item correct K.T. times, and we would 
expect student i to get the item incorrect K*(l-T.) 

times. Therefore, if N is the total number* of subjects ^ 

1 N 

= — E K.T (6.1) 
KN i=l 

= f 
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= T(l - T) 



(6.2) 



For an exampi^e of these statistics see Table 6-4 which 
uses the synthetic data presented in Table 6-1 and 
assumes, for the sake of illustration, that X = 12. 

Table 6-4 also indicates the error scores associated 
with each observed score for our synthetic data. The 
mean and variance of the error scores are given by: 



1 K N 

E = — E i: (X, ^ - T. . ) 
KN j=l i=l 



1] i:) 



IN IN 
— Z K'T. --IT. 
KN i=l ^ N i=l ^ 

f - f 

0 (6.3) 



- 1 K N 2 

and Sp = — E I (X - T ) ^ 
^ KN j=l i=l 

1 N 1 K • - 

= _ E ( - E (X - T. .)^ 1 
N i=l K j = l 

INIK 2K IK- 

= _ E [ - EX..-- E X. .T. . + - E T7 . ] 
N i=l K j=l K j=l K j=l 

I N 2 1 

= - E ( T. - - (K.T ) + - (K-T ) ] 
N i = l ^ K ^ K ^ 

IN, 
= - E (T. - Tf) 
N i=l ^ 



1 N 
- E 
N 1=1 



- E T^ (1 - T^) (6.4) 
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Now, let us 



demonstrate that 



2 



2.2 

E * 




2 ^ N 2 

N 1=1 



] + I - E T. (1 - T. ) ] 
N i«l ^ ^ 




1 

+ - ET. 



N i ^ 




» T 




= s 



2 
X 



Thus, we have demonstrated that, by interpreting 
our PCI scores as true scores we can express the mean and 
variance of observed scores in terms of the true scores. 
Furthermore, we have shown that the variance of the 
observed scores does indeed equal the variance of the 
true scores plus the variance of the error scores. 
The mean and variance of the observed, true, and error 
scores are provided in Table 6-4. For reference now 
and later, the reader should note. that, for our synthetic 
data 



20 

E T. = 10.41 , 
i=l ^ 



E TT = 7.9053 , and 
i=l ^ 



E TT = 6.8687 . 



i=l 



1 
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Reliability of a one - item - test . Using the above 
results^ we can express the reliability of a one-item 
test as: 

^11 " 4 / 



2 



N 



T(l - T) 



ZT - N-T 



For our synthetic data, 

0.124 

r.. = = 0.498 

0.249 



The reader should keep in mind that r^, is the 
proportion of variance in observed scores not due to 
guessing, whereas (1 - r^^) is the proportion of variance 
in observed scores due to guessing. Now, we call r^^^ 

the reliability of a one-item test; however, if there 

are random errors operating other than those due to guessing, 
then r, , will be an upper- limit to the "true" reliability 
of the item. 

In order to estimate the reliability of a test con- 
sisting of K replications of the item, we can use the 
Spearman-Brown Prophecy Fonnula 

Kr 

^vv = ~ (6.6), 

1 - (K - l)r^^ 



Another way to view the reliability of a one-item 
test is to ask how many items of a similar nature would 
have to be administered in order to obtain a given level 
of reliability. This question can be answered by 
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re-arranging th^ terms in ^ Spearman-Brown Prophecy 
Formula in order to get 



^11 <^ - -^KK^ 



(6.7) 



where, 5u this case, r^.^- is the level rf reliability 
desirod and K is the number of items necessary to 

achieve this level of reliability. Using our synthetic 
.iata, if we set r^^^ = 0.90, then 

0190(1 - 0.498) 

K = = 9.072 . 

0.498(1 - 0.90) 

One further statistic, of a reliability nature, may 
be of interest. It can be shown that the probability that 
a randomly selected student would maintain his or her 
observed score on L = 2 or 3 administrations of the same 
item is: 

= 1 - s^ . (5.8) 

For our synthetic data, 

P2 = 1 - 2(0.125) = 0.750 
and P3 = 1 - 3(0.125) = 0.635 . 



ression of observed score s on true scores . The 
standar error of measurement is the square root of the 
expression in (6.4), which is also equal to 



s. 



s^ Jl - r^^ (6.9) 

For our synthetic data. 



or 



Sg = /0T125 = 0.354 

= /0T249 Jl - 0.498 =0.354 . 
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The redder should recall that the standard error of 
measurement is associated with the regression of observed 
scores on true scorec, as indicated, for our synthetic 
data, in Figure 6-1. This regression is used to predict 
observed scores from t2::ue scores. As such/ this regres- 
sion can be used to establish a confidence interval around 
the expected difficulty level of the item, where diffi- 
culty level is based on the classical scoring procedure 
and is merely the proportion of subjects who get an item 
correct. 

Regression of true scores on observe d scores. The 
other regression of interest is the regression of t.vue 
scores on observed scores. From classical test theory, 
this regression is: 



T « T(l - r^^) + r^^X (6.10) 

A 

where T is the estimated value of T assuming a linear 
regression of true on observed scores. The standard 
deviation of errors about this regression is called the 
standard error of estimate and denoted s For the 

kind cf data considered here, it can be shown that 

^ ZT - ZT^ NZT^ - (ZT) ^ 

= [• ] [ J ] (6.11) 

N NZT (ZT) 

2 

= "11 



Now, since there are only cwo possible observed scores 
for an item (0 and 1) it is also true that 



2 "> 2 

s = w„s . + w. s ^/,v , where 
est 0 estvO) i est(i) ' 



(6.12) 



2 

'est (0) 



regression line when X = 0 



' 2 3" 
ZT - IT 



N - IT 



Wq = 1 - T , 



IT - IT' 



2T2 



N - IT 



(6.13) 
(6.14) 
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6-17 



'est(l) " variance of the error scores 

about the regression line when X = 1 



IT 
ZT 



ZT 



21 



IT 



, and 



(6.15) 
(6.16) 



Figure 6-2 provides, for our synthetic data, the regres- 
sions of true scores on observed scores, as well as the 
values of the statistics indicated in (6.11), (6.13), and 
(6.15) . 
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FIGURE 6-2 



Regression of True Scores on Observed Scores 



(0 
(U 

u 

o 
o 
w 

« 

9 
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Observed Scores 
_2 



est 



'est (0) 



'est (1) 



0.062 
0.040 
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CHAPTER VII 



Data Analysis 



In this chpater define, present, and discuss 
a set of data that were collected in order to illustrate 
some of the issues, statistics, and proceaures consi- 
dered in previous chapters. The data reported should 
not be considered as necessarily indicative of either 
**good" or "bad" criterion-referenced tests or items. 



Design for Data Collection 

In the fall of ,1972 and the spring of 1973 two 
forms (A and B) of a 25-item criterion-referenced test 
for a course in educational measurement were admin- 
istered in both the pre- and posttest mode to 113 
students. 

In order to understand the design used for cidmin- 
istering these tests, the reader will- find it useful 
to refer to the format of Table 7-la . In this table 
(and other tables to be discussed in this chapter) 
the following notation is used: 

Factor Level Description 

2 

A a, test administered 



*1 

^2 



using SCoRule 

2 

A a^ test administered 

using "star" technique 
2 

B b^,b2,b2,b^^ blocks of subjects 

C Form A of test 

C C2 Form B of test 

D d^ Pretest 

D Posttest 

Also, note that a in place of a subscript indicates 

the mean over all levels of the factor being considered. 



All tables referenced in this chapter can be found 
at the end of the chapter. 

2 

Factors A and B should not be confused with forms 
A and B of the Pretest and the Posttest. 
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The reader should note several important facts 
about this design: 

(a) If we collapse the levels of the A factor^ 
we see that subjects in the first block received 
Pretest A and Pbsttest A, subjects in the second block 
received Pretest A and Posttest B, subjects in the 
third block recieved Pretest B and Posttest A, and 
subjects in the fourth block received Pretest B and 
Posttest B . Furthermore , note that subjects were 
randomly assigned to blocks. 

(b) The discussion above indicates, that the design 
is a (ba2anced) repeated measures design in which half 
of the available cells are empty* i^e., each subject 

cook one form of the Pretest and one form of the Posttest, 
and, thus, no subject took both forms of either the 
Pretest or the Posttest. In the opinion of this author, 
the constraints incorporated in the design are realistic 
in that it is often not feasible to obtain repeated 
measures for equivalent tests in the real world of 
course development and evaluation. 

(c) Although the constraint mentioned above is 
realistic, it is, nevertheless, somewhat restricting. 
For example, we cannot obtain direct measures of the 
equivalence of the two forms of the Pre- and Posttests. 
Also, when we examine summary statistics for tests and 
items, these statistics sometimes will be based upon 
different or partially overlapping samples of subjects. 

The actual items administered to sub j ects are 
provided in Appendix B (see footnote 1, below). All 
items are four-alternative objective items which had not 
been subjected to any previous validation or revision 
procedures. Therefore, these items are not necessarily 
"good" items. In fact, one of the purposes of this 
chapter is to illustrate a procedure discussed in 
Chapter V that might be used to collect data, report 
statistics, and identify items that may require revision. 
All test and item data were analyzed using DEC-TEST, 
which is described in Appendix A, and SPSS. 



Note that Forms A and B of* the Posttest actually 
contained 50 items; however, items 26-50 (identified 
as ZC26 to ZC50 in Appendix B) were the same items in 
both forms, and none of these items was intended to be 
equivalent to any item numbered 1 to 25. Therefore, for 
the purposes of this chapter, we shall treat only items 
1 to 25. 
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Another important aspect of the data collection 
procedure involves the way in which students responded 
to test items. For each item, each student identified 
the alternative he or she would pick if forced to pick 
one and only one alternative; also , each student 
indirectly reported his or her subjective probabilities 
for each alternative for each item. Subjects in level a^^ 
reported actual log scores (range of 0 to 100) 
for each alternative using a mechanical device called 
a SCoRule; these log scores were later transformed into 
sub j ective probabilities using a formula provided in 
Appendix A (see p. A-28) . Students in level a2 used the 
twelve-point "star" system for reporting their 
subjective probabilities (see Chapter VI and/or p. A-27) . 
The reader unfamiliar with confidence testing, the 
logarithmic scoring system, subjective probabilities, and/ 
or the "star" system would be well-advised to study 
pages 6-1 to 6-10, and the first section of Appendix A. 

Summary Statistics for Subjects and Tests 

The procedure whereby subjects responded to items 
may be summarized by saying that subjects did two 
things — they picked one alternative and they indirectly 
reported subjective probabilities. The "pick one" proce- 
dure allows us to calculate a classical correct/wrong 
(1 or 0) item score for each subject, while the 
"subjective probability" procedure (typically considered 
in conjunction with confidence testing, admissible 
probability measurement, or decision-theoretic testing) 
allows us to calculate or estimate a number of different 
item scores for each subject. (See Section I of /appendix 
A, especially pages A-11 to P-14.) 

Tables 7-la,b,c to 7-6a,o,c report means and 
standard deviations over testo and persons for six 
different types of subject srores. In these and other 
tables, the different scores for a subject are identified 
as: 

VAR(l) = Arithmetic mean of item confidence scores; 

i.e., each subject's score is the arithmetic 
mean of the subjective probabilities 
associated with the correct answer to each 
item. (Range = 0 to 1.) 

VAR{2) = Geometric mean of item confidence scores; 

i.e. , each subject's score is the geometric 
mean of the subjective probabilities 
associated with the correct answer to each 
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item. See Appendix p. A-3 for formulas. 
(Range = 0 to 1.) 

VAR(3) = Arithmetic mean of item log scores; i.e.^ 

each subject's score is the arithmetic mean 
of the log scores associated with the correct 
answer to each item. (Range = 0 to 100.) 

VAR(4) = Arithmetic mean of item elimination scores r 
which are estimated from the subject's 
subjective probabilities using a procedure 
described in Chapter VI^ p. 6-3^ and 
Appendix A^ pp. A-12 to A-14. 
(Range = -1 to 1.) 

VAR(5) = Arithmetic mean of item pseudo-classical 
scores^ which are* estimated from the 
subject's subjective probabilities using 
the procedure described in Chapter VI, 
pp. 6-4 to 6-5, and Appendix A, p. A-14. 
(Range = 0 to 1) 

VAR(6) = Arithmetic mean of classical item scores, 
which are determined directly from the 
"pick one" procedure. (Range = 0 to 1.) 

Table 7-7 reports means, standard deviations, and 
reliabilities for each of the four tests and for each 
of the six different kinds of subject scores. The 
reader should note that we report these reliabilities 
mainly for the sake of completeness. We do not claim 
that any of these tests consist of a homogeneous set of 
items, which is a logical pre- requisite to a meaningful 
internal consistency reliability. 

Tables 7-1 to 7-7 are presented for the reader who 
is interested in comparing the six different types 
of scores discussed above. For our purposes, in this 
chapter, we will concentrate primarily upon pseudo- 
classical scores. Recall that pseudo-classical scores 
are estimated classical scores which are determined from the 
subjective probabilities assigned by subjects to the 
alternatives of test items. As indicated previously, 
pseudo-classical scores are much less affected by 
guessing than are classical scores, one can directly 
determine ^ kind of item reliability from pseudo-classical 
item scores , and pseudo-classical scores are easily 
interpreted. Pseudo-classical scores , in fact, appear 
to have most of the advantages and few of the disadvantages 
of both classical scores and subjective probabilities. 
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In short, in the opinion of this author, pseudo-classical 
scores have considerable promise as a basis for 
validating criterion-referenced, mastery, and possibly 
norm-referenced test items. It should be noted that once 
an item has been validated using pseudo-classical scores, 
one can logically consider subsequently administering 
and scoring the validated item using classical proce- 
dures; however, it is somewhat- more difficult to justify 
Validating an item using log scores, subjective proba- 
bilities, or elimination scores, and then subsequently 
administering and scoring the item using classical 
procedures. 

Test means and standard deviations using pseudo- 
classical score.^ are presented in Tables 7-5a,b,c. 
Note that Tables 7-5b and 7-5c are primarily different 
ways of displaying the data in the cells of Table 7-5a. 
Let us now consider three hypotheses for both the 
Pretest (Table 7-5b) and the Posttest (Table 7-5c) 
aspects of these data: 

(a) There are no differences among means for those 
subjects who used the SCoRule (level a^) versus those 
subjects who used the star technique (level 

for recording responses to items. 

(b) The two forms of the test have equal means. 

(c) There are no differences among the means for 
subjects in each of the four blocks. Recall that 
subjects were randomly assigned to blocks, and, -therefore, 
we would not expect to find any such dif f erenc- . 

The results of testing these hypotheses are indicated 
in Tables 7-8 and 7-9, which are based upon the data in 
Tables 7-5b and 7-5c, respectively. (See footnote 1, 
below.) In both Tables 7-8 and 7-9, the first six 
contrasts are related ±0 the first hypothesis, the 
seventh contrast is related to the second hypothesis, 
and the last two contrasts are related to the third 
hypothesis. All contrasts were defined a priori . 



In Tables 7-8 and 7-9, the columns labelled "orth 
t" and "Bonf t" provide an indication of significance 
levels for multiple comparisons using the orthogonal t- 
test procedure and the Bonferroni t-test procedure, 
respectively. (The latter is also called Dunn's proce- 
dure.) Strictly speaking, for these analyses, the ortho- 
gonal t-test procedure is too liberal in declaring signi- 
ficant differences. The Bonferroni procedure is more 
conservative. 
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Let us now examine what the data reveal about 
each hypothesis: 

(a) There is a significant main effect for Factor A 
on the Pretest but not on the Posttest. One half of the 
Pretest contrasts that 'compare levels of A are signifi- 
cantly different from zero using the orthogonal t-test 
procedure for multiple comparisons. For all but one 

of the Pretest and Posttest contrasts, there is an J 

indication that students in level a2 achieved higher 

scores than students in level a^. In short, there 

is a definite trend for subjects who use the 

star technique to achieve higher scores than those who 

use the SCoRule, and this trend is more pronounced on 

the Pretest than on the Posttest. Probably these results 

indicate that students understand the star technique 

better than they understand the use of the SCoRule. 

(b) Contrast number seven in Tables 7-8 and 7-9 
indicates that the difference between the means for 
Forms A and B, for both the Pretest and the Posttest, is 
not significant. The reader should note, however, that 
differences between forms are confounded with differences 
between blocks. The best we can say is that we have no 
direct evidence to reject the hypothesis that forms 

are equivalent. 

(c) There is a significant main effect for Factor B 
on the Posttest but not on the Pretest. Contrast number 
nine in Table 7-9 indicates that the significant Posttest 
difference is primarily a result of the difference 
between the means for subjects in the second and fourth 
blocks (i.e., subjects who took Posttest B) . Since 
subjects were randomly assigned to blocks, the author has 
no explanation for this result, other than the rather 
obvious statement that random assignment does not 
guarantee equality of means. (Note that Table 7-5c indi- 
cates that block b2 for the Posttest has a considerably 
lower mean than any other block for the Posttest, 
incluiing blocks associated with Posttest A.) 

In the next two sections we will analyze each of 
the items that make up both forms of the criterion- 
referenced Pretest and Posttest. . In these sections we 
will continue to emphasize pseudo-classical item scores, 
although we will, on occasion,, report statistics based 
upon subjective probabilities associated with items 
and classical item scores. 
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Item Equivalence 



Let us review the nature of each of the tests 
considered here. There are two forms (A and B) of the 
Pretest and two forms (A and B) of the Posttest. 
Pretest A and Posttest A are identical, item by item, 
and the same is true of Pretest B and Posttest B. 
If we let "iV be a generic item number, then item i on 
Form A (in both the Pre- and Posttest) is intended to be 
equivalent to item i on Form B (in both the Pre- and 
Posttest). In brief, there are two different tests, 
or sets of items (Form A and Form B) administered at two 
different times (Pretest and Posttest) . Consequently, 
a complete analysis of item equivalence must consider 
the issue of equivalence for each item for both the 
Pretest and Posttest mode. 

If we generalize from classical procedures for 
testing the equivalence of two tests, we would test the 
equivalence of two items in, say, the Posttest mode, by 
administering both items to the same set of subjects 
at the time of the Posttest. Then, if the means and 
standard deviations of the two items were the same, we 
could claim that the two items are equivalent, and the 
correlation between the item scores for the two items 
could be interpreted as a coefficient of equivalence 
for the item. However, the design used to collect our 
data will not permit such a procedure since, as indica- 
ted previously, the same subjects never take both forms 
of an item in either the Pretest or the Posttest mode. 
Also, for this reason, we cannot use Cochran's Q-test 
(discussed in Chapter V, p. 5-7) to test item equivalence 
when items are scored in the 'classical correct/wrong 
manner. 

In short, we cannot obtain a direct measure of item 
equivalence for the two forms of any item given the 
design for data collection employed here. However, since 
subjects were randomly assigned to blocks, and since, 
for the most part, there are no significant differences 
between block means for the Pre- and Posttests, we can 
partially consider the statistical issue of item equiva- 
lence by examining the differences between Form A and 
Form B item means and standard deviations. Tables 7-10 
to 7-12 present the appropriate item statistics when items 
are scored using subjective (confidence) probabilities, 
classical scores, and pseudo-classical scores, respec- 
tively. 
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Let us consider Table 7-12, which is based upon 
speudo-classical item scores, in some detail. The means 
reported can be interpreted in a manner similar to 
item difficulty levels. The difference between means 
for the two forms of any item is tested using a t-test 
for independent samples. The equivalence of item stan- 
dard deviations is tested using the FMAX statistic, 
which is the ratio of the larger variance divided by the 
smaller variance, and which has an F-distribution. Since 
we are performing multiple tests of significance it is 
advisable to distribute the a-level (.05) equally over 
all 25-items; thus, it is advisable to consider a differ- 
ence or FMAX value to be significant only if p<.002 = 
.05/25. 

In addition to comparing means and standard devia- 
tions for the two forms of any item, when we use pseudo- 
classical scores, we can also compare the item relia- 
bilities discussed in Chapter VI. These reliabilities 
are provided in Table 7-13. 

We can summarize the critical information in 
Tables 7-12 and 7-13 in the following manner. 

Item Pretest Differences in: Posttest Differences in: 



No. Mn's Sp' s r ' s Mn's SD's r ' s 

2 X X 

3 X 
7 X 

9 X 

11 XXX 

13 X 

14 X 

15 X 

21 XX 

22 XX 

23 XX 

24 XX 



In the above table, an "x" appears only if p<.002, and 
the items listed are only those for which at least one 
pretest or posttest difference is significant at p<.002. 
Clearly there is some evidence that the two forms of 
some items are not equivalent, for either the pretest 
mode or the posttest mode or both modes. Note that if 
two items are equivalent when administered in the pretest 
mode, this does not guarantee that the items will be 
equivalent when administered in the posttest mode, and 
vice-versa. 
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Data for Identifying Items that - may Require Revision 

In Chpater V the author specified a procedure for 
identifying items that may require revision. The basic 
data (or summary statistics) and rules for this procedure 
are summarized in Table 5-1. The results of applying 
this procedure (with some modifications and additions) 
to the items discussed in this. Chapter are indicated in 
Tables 7-14 to 7-17. The reader should note that for 
each of these Tables: (a) each item was scored using 
the pseudo-classical scoring procedure; (b) pretest and 
posttest item reliabilities are considered as data for 
decision-making, along with the data discussed in 
Chapter V; (c) the Theoretical Error Rate (TER) is 0.75 
for all items, since all items have four alternatives; 
and (d) an "x" indicates that revision may be required 
on the basis of the indicated rule. 

The reader will note from the title for each of 
the four tables that: (a) the data reported in Table 7-14 
are for the 31 subjects who took Form A for both the 
Pre- and Posttest, (b) the data reported in Table 7-16 
are for the 28 subjects who took Form B for both the 
Pre- and Posttest, and (c) for Tables 7-15 and 7-17 
the sets of subjects who took the Pre- and Posttests are 
not the same, although there is a considerable degree of 
overlap. 

It should be noted that the decision rules specified 
in Tables 7-14 to 7-17 are, in several cases, based upon 
the author's subjective judgments with regard to the 
context within which the items were used. For example, 
there is no "objective" basis for saying that an item 
may need revision if PMPG<.50 — others might argue for 
a cut-off value of, say, 0.40 or 0.60. It is also possi- 
ble that another evaluator examining the same items, 
might choose to add other statistics and/or decision 
rules, or an evaluator might even choose to eliminate 
certain statistics and/or rules. The important issaes 
are that: (a) the decision rules be specified prior to 
an examination of the data, (b) the actual rules and 
cut-offs chosen have at least a logical basis for being 
stated, and (c) the procedure used for examining item 
data be systematic and, as much as possible, replicable. 

A cursory analysis of Tables 7-14 to 7-17 will 
convince the reader tV^at PMPG is often less than 0.50 
and PER is often greater than 0.40. Thus, at a minimum, 
the instruction for the information tested by many of 
these criterion-referenced items has not been as effective 
as the author had hoped. 
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The actual task of determining which itens to 
revis ? and what kinds of revisions to make involves: 
(a) using the item statistics and tests discussed in 
the previous section of this Chapter in order to deter- 
mine Mhose pairs of items that do not appear to be 
equivalent and (b) using statistics and rules of the 
kind reported in Tables 7-14 to 7-17 (as well as other 
supporting data such as item analysis tables) in order 
to determine which particular items and what aspects of 
such items require revision. 

At the risk of being repetitious, we wish to state 
again that even if the data indicate that revision may 
be required, one must study the item carefully to 
determine what, if anything, needs revision. For 
example, there appear to be problems with both forms 
of item 21; yet, after analyzing the data, the item 
analysis tables for the two forms of the item, and the 
actual items themselves, no obvious problem with either 
item was apparent. Therefore, the author intends to 
retest both forms of item 21 at some future time, and if 
the same situation still prevails, then the author will 
eliminate or completely rewrite both items. 
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• TABLE '7- la 

Means and Standard Deviations — P retest and Posttest 
VAR(l) = Arithmetic Mean of Item Confidence Scores 



Pretest Posttest 





Fm A 


Fm B 


Fm A 


Fm B 


N 




c, d. 
11 


2 1 


c, d^ 
12 


c_d«» 
2 2 




.312 
.060 




.555 
. 142 




21 




.373 

07C 




.602 
. 09 2 


• 


10 




.332 
.046 






.499 
.150 


19 


^2^2 


.342 

037 






.516 
.116 


9 






.330 
.049 


535 
. 109 




17 


^2^3 




.332 
064 


. 545 

119 




9 






.322 
.066 




.493 
.166 


20 


^2^ 




.370 
.064 




.625 
.141 


8 


a b 

• • 


.333 
.057 


,333 
.061 


.556 
.120 


.518 
.153 


113 
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TABLE 7- lb 

Means and Standard Deviations — Pretest 
VAR(l) = Arithmetic Mean of Item Confidence Scores 



r m r\ 


r 111 t\ 


r ill D 


r nt D 




^1 


b 

°2 


b 

°3 


'^4 


K 

LJ 

m 


.312 


.332 


.330 


.322 


.324 


.060 


.046 


.049 


.066 


.056 


N=21 


N=19 


N=17 


N=20 


N=77 


.373 


.342 


.332 


.370 


.354 


.070 


.037 


.064 


.064 


.060 


N=10 


N=9 


N=9 


N=8 


N=36 


.332 


.335 


.331 


.336 


.333 


.068 


.043 


.053 


.068 


.059 


N=31 


N=28 


N=26 


N=28 


H=113 



TABLE 7-lc 

Means and Standard Deviations — Posttest 
VAR(l) = Arithmetic Mean of Item Confidence Scores 





Fm A 


Fm B 


Fir. A 


Fm B 


Both 














^1 


.555 


.499 


.535 


.493 


.521 




.142 


.150 


.109 


.166 


.144 




N=2] 


N=19 


N=]7 


N=20 


N=77 




.602 


.516 


.545 


.625 


.571 




.092 


.116 


.119 


.141 


.120 




N=10 


N=9 


N=9 


N=8 


K=36 


a 


.570 


.505 


.5 38 


.530 


.537 


• 


.128 


.138 


.110 


.168 


.138 




N=31 


N=28 


N=26 


N=28 


N=113 



o 
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TABLE 7- 2 a 

Means and Standard Deviations — Pretest and Posttest 



VAR(2) 


= Geometric Mean of 


Item Confidence 


Scores 




Pretest 


Posttest 






. Fm A Fm B 


Fm A Fin B 


N 






^1^2 ^2^2 


1 1 


.266 
.040 


.393 
.119 


21 


2 1 


.251 
.061 


. 375 
.070 


10 


a, b^ 
1 2 


.248 
.049 


. 374 
.139 


19 


a^b^ 
2 2 


.239 
.039 


.344 
.102 


9 


a, b-, 
1 3 


.236 
.034 


.385 
.101 


17 


a..b^ 
I 3 


.267 
.049 


.362 
.133 


9 




.045 


^5 Q O 
• J O <^ 

.136 




^2^4 


.289 
.032 


.478 
.123 


8 


a b 


.253 .253 
.047 .044 


.383 .367 
.107 .133 


113 
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TABLE 7- 2b 

Means and Standard Deviations — Pretest 
VAR{2) = Geometric Mean of Item Confidence Scores 



Fm A 


Fm A 


Fm B 


Fm B 


Both 




^2 






b 


a. .266 


.248 


.236 


.247 


.250 


^ .040 


.049 


.034 


.045 


.043 


N=21 


N=19 


N=17 


N=20 


N=77 


a, .251 


.239 


.267 


.289 


.260 


^ .061 


.039 


.049 


.032 


.049 


N=10 


N=9 


N=9 


N=8 


N=36 


a .261 


.245 


.247 


.259 


.253 


• 04 8 


.045 


.042 


.045 


.045 


N=31 


N=28 


N=26 


N=28 


N=113 




TABLE 


7-2c 






Means and 


Standard Deviations 


— Posttest 


VAR(2) = Geometric Mean of 


Iten Confidence 


Scores 


Fm A 


Fm B 


Fm A 


Fm B 


Both 






^3 






a, .393 


.374 


.385 


.382 


.384 


^ .119 


.139 


.101 


.136 


.123 


N=21 


N=19 


N=17 


N=20 


N=77 


a, .375 


.344 


.362 


.478 


.387 


^ .070 


.102 


.133 


.123 


.115 


N=10 


N=9 


Ni9 


N=8 


N=36 


a .387 


.364 


.377 


.410 


.385 


.105 


.127 


.111 


.137 


.120 


N=31 


N=28 


N=26 


N=28 


N=113 
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• TABLE 7r3a - — 

Means and Standard Deviations — • Pretest and Posttest 
VAR(3) = Arithmetic Mean of Item Log Scores 



Pretest Posttest 





tm A 


Fm B 


tia A 


Fm B 


N 




c,d. 




*'l*='2 


c-d. 




70.936 
3.727 




78.687 
7.124 . 




21 




68.233 
6.254 




78.402 
3.929 




10 




69.263 
4.792 






77.245 
7.504 


19 




68.629 
3.626 




- 


75.892 
7.019 


9 






68.430 
3.184 


78.578 
5.650 


- 


17 




■ 


70.928 


. 76.836 




9 






69.315 
4.137 




77.953 
7.154 


20 


*2^4 




72.930 
2.413 




83.365 
5.395 




a b 

• • 


69.757 
-4.543 


69.841 
3.892 


78.312 
6.090 


78.189 
7.213 


113 
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TABLE 7- 3b 





Mea'-v*: and 


Standard Deviations 


— Pretest 




VAR ( 3 ) = 


Arithmetic 


Mean of Item Log Scores 




Fm A 




Fm A 


Fm B 


Fm B 


Both 




^1 






°3 


*^4 


D 

• 


^1 


70 .SSb 




69.263 


68.430 


69.315 


69 .549 




3.727 




4.792 


3.184 


4.137 


4.044 




N=21 




N=19 


N=17 


N=20 


N=77 


^2 


69.233 




68.629 


70.928 


72.930 


70.327 




6.254 




3.626 


4.298 


2.413 


4.601 




N=10 




N=9 


N=9 


N=8 


N=36 


a 


70.387 




69.059 


69.295 


70.348 


69.797 


• 


4.653 




4.393 


3.724 


4.040 


4.226 




N— 1 
w — O X 




N=28 


N=26 


N=28 


N=113 








TABLE 7- 3c 








Means and 


Standard 


Deviations - 


- Posttest 




VAR (3) = 


Arithmetic 


Mean of Item Log Scores 




Fm A 




Fm B 


Fm A 


Fm B 


Both 




^1 




^2 


^3 


^4 


b 

• 


^1 


78.637 




77.345 


78.578 


77.953 


78.141 




7.124 




7.504 


5.650 


7.154 


6.820 




N=21 




N=19 


N=17 


N=20 


N=77 


^2 


78.402 




75.892 


76.836 


83.365 


78.486 




3.929 




7.019 


6.965 


7.054 


6.326 




N=10 




N=9 


N=9 


N=8 


N=36 


a 


73.595 




76.878 


77.975 


79.500 


78.251 


• 


c on -3 

w « V/ ^ 




7.254 


6.055 


7.054 


6.640 




N=31 




N=28 


N=26 


N=28 


N=113 
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TABLE 7- 4 a 

Means and Standard Deviations — Pretest and Posttest 
VAR(*) = Arithmetic Mean of Item Elimination Scores 



•Pretest 



Posttest 



Fm A 



Fm B 



Fm A 



Fm B 
^2^2 



N 



a.b. .150 
^ ^ .124 



,552 
,200 



21 



a,b, .269 
^ ^ .112 



,624 
,098 



10 



a,b- .194 
^ ^ .102 



.444 
.213 



19 



a-b- .212 
^ ^ .079 



.458 
.172 



a^b3 



.186 
.091 



563 

,158 



17 



^2^3 



.196 
.132 



,526 
,199 



^1^4 



.171 

.106 



.525 20 
.200 



^2^4 



.243 
.105 



.648 
.138 



a b .194 .191 

• • .114 .106 



,564 
,172 



.504 113 
.201 
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TABLE 7- 4b 



Means and Standard Deviations — Pretest 
VAR(4) = Arithmetic Mean of Item Elimination Scores 





nil A 


Fm A 


Fm B 


Fm B 


Both 




1 


z 


3 


4 


b 

• 


^1 


.150 


.194 


.186 


.171 


.174 




.124 


.102 


.091 


.106 


.106 




N=21 


N=19 


N=17 


N=20 


N=77 


^2 


,269 


.212 


.196 


.243 


.231 




.112 


.079 


.132 


.105 


.108 




N=10 


. N=9 


N=9 


N=8 


N=36 


a 


.189 


.200 


.189 


.192 


.192 


• 


.131 


.094 


.104 


.109 


.110 




N=31 


N=28 


N=26 


N=28 


N=113 



TABLE 7- 4c' 

Means and Standard Deviations — Posttest 
VAR(4) = Arithmetic Mean of Item Elimination Scores 





Fm A 


Fm B 


Fm A 


Fm B 


Both 




^1 


^2 


b3 


^4 


b 

• 


^1 


.552 


.444 


.563 


.525 


.521 




.200 


.213 


.158 


.200 


.197 




N=21 


N=19 


N=17 . 


N=20 


N=77 




.624 


.458 


.526 


.648 


.563 




.098 


.172 


.199 


.138 


.168 




N=10 


N=9 


N=9 


N=8 


N=36 


a 


.575 


.448 


.550 


.560 


.534 


• 


.175 


.198 


.170 


.190 


.188 




N=31 


N=28 


N=26 


N=28 


N=113 
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TABLE 7- 5 a 

Means and Standard Deviations — Pretest and Posttest 
VAR(5) = Arithmetic Mean of Item Pseudo-Classical Scores 



.Pretest Posttest 





Fm A 
1 1 


Fm B 
2 1 


Fm A 

c, d^ 
1 2 


Fm B 
2 2 


N 


1 1 


.360 
.086 




.658 
.132 




21 


2 1 


.424 
.081 




.700 
.069 




10 


12 


.378 
.071 






.569 
.156 


19 


a^b^ 
2 2 


.403 

.046 






.573 
.108 


9 


1 3 




.367 
.071 


.666 
. .115 




17 


^2^3 




. 386 
.096 


.634 
.143 




9 


^1^4 




.357 
.078 




.626 
.150 


20 


^2^4 




.427 
.091 




.737 
.097 


8 


a b 

• • 


.383 
.077 


.376 
.082 


.664 
.118 


.614 
.148 


113 
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TABLE 7-*5b 

Means and Standard Deviations — Pretest 
VAR(5) = Arithmetic Mean of Item Pseudo-Classical Scores 



a 



X ill r\ 


£ 111 n 




Pm B 


Both 




"^2 


• b 

°3 


b - 


b 


.360 


.378 


.367 


.357 


.365 


.086 


.071 


.071 


.078 


.076 


N=21 


N=19 


N=17 


N=20 


N=77 


.424 


.403 . 


.386 


.427 


.410 


.081 


.046 


.096 


.091 


.079 


N=10 


N=9 


N=9 


N=8 


N=36 


.381 


.386 


.374 


.377 


.380 


.088 


.064 


.079 


.086 


.079 


N=31 


N=28 


N=26 


N=28 


N=113 



TABLE 7-5c 

Means and Standard Deviations — Posttest 
VAR (5) = Arithmetic Mean . of Item Pseudo-Classical Scores 





Fm A 


Fm B 


Fm A 


Fm B 


Both 






^2 


^3 


^4 


b 


^1 


.658 


.569 


.666 


.626 


.630 




.132 


.156: 


.115 


.150 


.142 




N=21 


N=19 


N=17 


N=20 


N=77 


^2 


.700 


.573 


.634 


.737 


.660 




.069 


.108 


.143 


.097 


.120 




N=10 


N=9 


N=9 


N=8 


N=36 


a 


.672 


.570 


.655 


.658 


.639 


• 


.116 


.140 


.123 


.144 


.136 




N=31 


N=28 


N=26 


N=28 


N=113 



o 
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TABLE 7- 6 a 

Means and Standard Deviations — Pretest and Posttest 
VAR(6) = Arithmetic Mean of Classical Scores 



Pretest Posttest 





Fm A 


Fm B 


Fm A 


Fm B 


N 






2 1 


c, d-» 
12 


c-d- 
2 2 


a, 
11 


.404 
.089 




.691 
.118 




21 




.464 
.076 




.700 
.063 




10 


a, b** 
12 


.444 
.080 






.634 
.146 


19 


a**b ^ 
2 2 


.418 
.098 






.578 
.104 


9 


a-i b-* 
13 




.419 
.108 


.678 
.122 




17 






.115 


. D D ^ 

.122 




Q 






.414 
.090 




.644 
.135 


:20 


^2^4 




.430 
.102 




.760 
. .117 


8 


a b 
• • 


.429 
.087 


.424 
.100 


.685 
.110 


.646. 
.139 


113 
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TABLE 7- 6b 

Means and Standard Deviations — Pretest 
VAR(6) = Arithmetic Mean of Classical Scores 





Fm A 


Fm P. 


Fm B 


Fm B 


Both 




^1 












.404 


.444 


.419 


.414 


.420 




.089 


.080 


.108 


.090 


.091 




N=21 


N=19 


N=17 ■ 


N=20 


N=77 




.464 


.418 


.449 


.430 


.441 






.098 


,115 


.102 


.095 




N=10 


N=9 


N=9 


N=8 


N=36 


a 


.423 


.436 


.429 


.419 


.427 


• 


.089 


.085 


.109 


.092 


.093 




N=31 - 


N=28 


N=26 


N=28 


N=113 






TABLE 7- 6c 








Means and 


Standard 


Deviations 


— Posttest 






VAR(6) = Arithmetic 


Mean of Ciassical Scores 




Fm A 


Fm B 


Fm A 


Fm B 


Both 




^1 


• ^2 


*^3 


^4 


h 


^1 


.691 


.634 


.678 


.644 


.662 




.118 


.146 


.122 


.135 


.131 




N=21 


N=19 


N=17 


N=20 


N=77 




.700 


.578 


.662 


.760 


.673 




.063 


.104 


.122 


.117 


.118 




N=10 


N=9 


N=9 


N=8 


N=36 


a 


.694 


.616 


.672 


.677 


.666 


• 


.103 


.135 


.120 


.139 


.126 




N=31 


N=28 


N=26" 


N=28 


N=--=113 
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TABLE 7-7 



Test 


Reliabilities for 


Six 


Different 


Types of 


Scores 




Pretest A 


(N=59). 






Pretest B 


(N=54) 






Mean 




r 




Mean 


SD 


r 


VAR(l) 


.333 


.057 


.547 


VAR(l) 


.333 


.061 


.599 


VAR(2) 


.253 


.047 


.324 


VAR(2) 


.253 


.044 


**** 


VAR(3) 


69.757 


4.543 


.318 


VAR(3) 


69.841 


3.892 


.049 


VAR(4) 


.194 


.114 


.430 


VAR(4) 


.191 


.106 


.229 


VAR(5) 


.383 


.077 


.371 


VAR(5) 


■ .376 


.082 


.395 


VAR(6) 


.429 


.087 - 


.041 


VAR(6) 


.424 


.100 


.130 



Posttest A (N=57) 





Mean 


SD 


r 


VAR(l) 


.556 


:i2o 


.799 


VAR(2) 


.383 


.107 


.359 


VAR(3) 


78.312 


6.090 


.442 


VAR(4) 


.564 


.172 


.677 


VAR(5) 


.664 


.118 


.631 


VAR(6) 


.685 


.110 


.474 





Posttest 


B (N=56) 






Mean 


SD 


r 


VAR(l) 


.518 


.153 


.892 


YAR(2) 


.387 


.133 


.733 


VARf3) 


78.189 


7.213 


.677 


VAR(4) 


.504 


.201 


..756 


VAR(5) 


.614 


.148 


.757 


VAR(6) 


.646 


.139 


.646 



Note. — All reliability coefficients, except those for 
VAR(2), were calculated using Hoyt's analysis of variance 
technique. When a subject's score is the geometric mean 
of the subjective probabilities associated with the correct 
answers to items [VAR(2)] , one cannot employ Hoyt's 
technique for calculating reliability; therefore, for VAR(2), 
we report odd-even split-halves coefficients. 

**** indicates that the coefficient could not bs 
calculated. 

Note. — One can calculate Livingston's reliability 
coefficient for any criterion score or cut-off value 
using the means, standard deviations, and reliability 
coefficients reported above. 
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CHAPTER VIll 



Summary and Suggestions 



Summary 

Since this report is quite long (much longer, in 
fact, than the author had intended) and, in many cases, 
quite detailed , it seems advisable to provide the reader 
with a brief summary of each of the chapters. 

Chapter I^. The major purposes of this chapter 
are to provide a context within which this report fits, 
and to introduce the reader to distinctions in termin- 
ology- We indicate, for example, that distinctions can 
be made between criterion-referenced testing and mastery 
testing, in that mastery testing can be viewed as a 
specific kind of criterion-referenced testing. However, 
this distinction is not maintained very well in the 
literature; thus, in order to avoid confusion with 
previous literature, we have, in general, reserved the 
term "mastery" for those issues, statistics, etc, that 
have previously carried the label "mastery." 

Chapter II, The major purpose of this chapter is 
to examine th~relevance of classical test theory to 
criterion-referenced and mastery testing. We find that 
the classical test theory assumptions are general enough 
to form a basis for criterion-referenced and mastery 
testing; however, we question whether or not these 
assumptions are sufficient. Furthennore, we find that 
the binomial error model is more likely to be appropriate 
for most criterionrref erenced tests than is the normal 
error model. 

Chapter III > In this chapter, which is primarily 
a review of the literature, we consider the concept of 
validity with respect to criterion-referenced and mastery 
testing. We find that content validity is of paramount 
concern for criterion-referenced and mastery testing, 
since there is seldom available any extra-test criterion 
measure . Consequently , the validity of a criterion- 
referenced or mastery test is, from a practical point of 
view, very clearly tied to the procedure whereby the 
test is developed. We find that the "i:em forms" proce- 
dure is highly desirable in that this procedure 
guarantees a certain degree of "objective-item congruence." 
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Chapter IV. In this chapter ^ which is primarily 
a review of tEe literature ^ we consider the concept of 
reliability with respect to criterion-referenced and 
mastery testing. Reliability issues are probably the 
most frequently discussed quantitative issues surrounding 
criterion-referenced and mastery testing. We report ^ 
criticize^ and compare. each of the major reliability 
indices that have been proposed in the literature , and 
we find that there is considerable disagreement (or, 
perhaps y confusion) among researchers with respect to 
reliability issues. In particular^ researchers have 
often failed to disti nguish between (a) reliability 
indices for criterion-referenced and mastery tests , 
(b) reliability in the sense of stability^ equivalence, 
or internal consistency, and (c) reliability for measures 
of state and measures of change. There is also some 
evidence for confusion between indices for test reliability 
and indices for instructional effectiveness. 

Chapter V. In general, the first four chapters treat 
criterion-referenced and mastery testing without directly 
considering issues that are specific to an analysis of 
individual items. Chapters V, VI, and VII, on the other 
hand, are primarily concerned with the analysis of 
criterion-referenced and mastery items, per se. In 
Chapter V we discuss statistics that have been suggested 
for analyzing such items, and we present a procedure for 
identifying items that may require revision. The proce- 
dure discussed necessitates calculating a set of statistics 
for each item and defining a set of rules to specify how 
to employ the item statistics in order to identify items 
that appear to require revision. 

Chapter VI. For the most part. Chapter V involves 
the explicit assumption that items are scored in the 
classical correct/wrong manner. In Chapter VI we consider 
alternatives to the classical procedure. Specifically, 
we consider elimination scoring and various scoring 
procedures that entail the collection of subjective proba- 
bilities form each student for each alternative of an 
item. We find that elimination scoring is of questionable 
value in the analysis of criterion-referenced and mastery 
items, but scoring procedures that entail subjective 
probabilities appear to have promise. In particular, 
we define and examine a new kind of score called a "pseudo- 
classical score" which appears to be quite useful as a 
basis for examining the reliability and validity of 
criterion-referenced and mastery items. 
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Chanter VII . In this chapter we present a statis- 
tical analysis of a set of item data which we use to 
illustrate many of the statistics and procedures 
discussed in Chapters V and VI. 

Appendix A. In this appendix we present the manual 
for DEC-TEST, a Fortran IV computer program written by 
the author. DEC-TEST uses subjective probabilities in 
order to calculate a number of student scores over items 
(typically associated with confidence testing or admissible 
probability measurement) and a number of item scores 
(including confidence, elimination, and pseudo-classical 
Scores). Also, DEC-TEST has an extensive capability for 
item analysis. The manual in Appendix A provides and 
extensive guide to the use of DEC-TEST, a detailed 
explanation of all outputs, scores, and statistics, and 
an introduction to the use of subjective probabilities in 
testing. 



Suggestions for the Researcher 

It is probably safe to say that there are no 
definitive answers to any issue in criterion-referenced 
or mastery testing; thus, in a sense, every issue is a 
potential topic for research. However, I would like to 
identify a few issues which I feel are critical or often 
overlooked: 

(a) We need better statistical and non-statistical 
models for considering criterion-referenced and mastery 
testing — models in which assumptions and criteria are 
stated clearly a-^H unambiguously. For example, I believe 
that we need a test-theoretic model for criterion-refer- 
enced testing that' incorporates both random error and 
systematic error and that employs a definition of true 
score which is different from the classical definition. 

(b) We need more integrated theoretical and practical 
work concerning the reliability and validity of criterion- 
referenced and mastery tests. 

(c) We need alternative procedures for item construc- 
tion. In particular, we need a better capability of 
constructing item forms for disciplines other than 
mathematics and the physical sciences. 

(d) We need much more consideration of alternative 
procedures for scoring items and defining criterion 
performance. At the present time, almost exclusively, 
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items are scored in a correct/wrong (1,0) manner and 
criterion performance is defined in terms of number of 
items correct. This implies that we are only concerned 
about whether or not a student can recognize or recall 
a correct answer, and we are not concerned about things 
like the degree of certainty tnat a student associates 
with his or her response. At any rate, it is difficult 
to believe that the classical correct/wrong procedure 
is ;:he best, or the only appropriate, method for 
scoxing items and defining criterion performance. 

/e) We need more consideration of issues surrounding 
the identification of inadequate criterion-referenced 
and mastery test items and procedures for revising such 
items . 



Suggestions for the Practitioner 

Chapter II, Chapter VI , parts of Chapter VII, and 
Appendix A are probably of marginal concern at the present 
time for most practitioners. However, the author feels 
that most practitioners should be familiar with the 
issues treated in the remaining parts of this report. 
In particular, attention should be given to Chapters 
1^ ITT, and V, Also, the bibliography provided on the 
next few pages should be especially useful to most 
practitioners. The issue of reliability treated in 
Chapter IV is exceedingly important; however, it is 
unfortunately true that there is no. generally accepted 
procedure for calculating the reliability of a criterion- 
referenced or mastery test. Thus, the author suggests 
that practitioners study Chapter IV but be very cautious 
in using or interpreting any single index of reliability. 
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Methods of Administering and Scoring a Test 

Classical Testin^ > The classical method of admin- 
istering and scoring a test item necessitates that a student 
indicate which alternative he or she believes is correct. 
If the student picks the correct alternative, then the 
student receives one point , otherwise , the student receives 
zero points. This very simple procedure forms the basis 
for much of classical test theory, and this procedure is 
quite useful for many purposes. However, this procedure 
clearly does not provide differential information about the 
relative attractiveness of each alternative for the student. 
One way to approximate such information is through elimina- 
tion scoring; one way to actually accumulate such informa- 
tion is through decision-theoretic testing . 

Elimination Testing . In elimination testing, the 
student indicates which alternatives he or she believes 
are incorrect . The student gets the highest possible item 
score (usually 1.0) when he or she eliminates all alter- 
natives except the correct answer; the student gets the 
lowest possible item score (usually -1.0) when- he or she 
eliminates only the correct answer. If neither one of these 
two extreme conditions prevail, then the student gets an 
intermediate score that is determined according to a specific 
scoring rule. Thus, elimination scoring provides some 
information about the relative attractiveness of each alter- 
native; but, for example, if a student eliminates two alter- 
natives, we do not know whether or not the student feels 
more uncertain about one alternative than about the other. 

Decision - Theoretic Testing^ In decision-theoretic 
testing a student responds to a test item by .providing 
reported (observed) probabilities for 'each alternative for 
the item, such that the reported (observed) probabilities 
sum to unity. Although there are a number of ways an item 
can be scored in a decision-theoretic testing framework, the 
scoring system employed by DEC-TEST is the logarithmic 



What we refer to as ^Mecision-theoretic testing" has 
been called, among other things, ^'confidence testing," 
"valid confidence testing," and "admissible probability 
measurement." Echternacht (1972) and Savage (1971) provide 
reviews of relevant literature concerning this topic. 
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scoring system. This system, as described in detail by 
Shuford et al . (1966), has a number of useful properties. 
One such property is called the "reproducing" property, 
which implies that a student will maximize his or her 
expected score if and only if the student's reported 
(observed) probabilities are identical to his or her 
degree-of-belief (true) probabilities. For example, if 
a student's degree-of-belief (true) probabilities for a 
three-alternative item are 0.50, 0.25, and 0.25, respec- 
tively, then the student will maximize his or her 
expected score only if he or she responds with reported 
(observed) probabilities of 0.50, 0,25, and 0.25, respec- 
tively. 

According to Savage (1971): 

Proper scoring rules hold forth promise as more 
sophisticated ways of administering multiple-choice 
tests in certain educational situations. The student 
is invited not merely to choose one [answer] (or 
possibly none) but to show in some way how his opinion 
is distributed over the [answers], subject to a proper 
scoring rule or a rough facsimile thereof. 

Though requiring more student time p.er item, these 
methods should result in more discrimination per item 
than ordinary multiple-choice tests, with a possible 
net c^iri' Also, they seem to open a wealth of oppor- 
tunities for the educational experimenter. 

Reasons for Programming DEC-TEST 

One of the principal reasons why the author undertook 
to program DEC-TEST was to examine each of the three 
scoring systems discussed above, especially with respect 
to their differential usefulness for item analysis in both 
norm-ref ernced and criterion*referenced situations. In 
order to do this DEC-TEST accepts decision-theoretic test 
data and estimates how a student would respond under 
elimination testing and classical testing rules. DEC-TEST 
can then perform an item analysis for each item for each 
type of testing procedure. 

Other resons that motivated the author to program 
DEC-TEST include: (a) a desire to provide the capability 
of obtaining a detailed analysis of student and item 
performance under decision-theoretic testing, (b) a desire 
to provide the capability of comparing estimates of relia- 
bility for the three types of nesting procedures discussed 
above, and (c) a desire to provide the capability of 
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examining a number of different issues concerning the 
use of decision-theoretic testing as a tool for 
measurement and evaluation. 

Features of DEC-TEST 

DEC-TEST is a computer program for Dec ision-Theoretic 
Testing and the Analysis of Item Data. DEC-TEST was 
programmed using the FORTRAN IV, Level G, compiler on the 
IBM-370/155 computer at the State University of New York 
at Stony Brook. 

DEC-TEST can accept any one of five dif rent kinds of 
input, and it can produce as many as forty fo..r different 
outputs. Both students and items can be identified alpha- 
numerically, student identifications can be sorted, missing 
data features are available, items can be weighted, and 
items can have 2-5 alternatives. Included in the different 
kinds of possible output are: (a) listings of control cards, 
input data, and observed probabilities, (b) 102 variables 
for each student (calculated, printed and/or punched), 
(c) sophisticated item analysis routines for decision- 
theoretic, elimination, and classical testing, (d) eight 
different rosters of student item scores (calculated, printed, 
and/or punched), and (e) eight different kinds of reliability 
analyses plus a summary of all reliability an^Ty^es. 

We caution the user of DEC-TEST in that many of the 
scores calculated and outputs provided are of very recent 
origin and requi>^e further study before their usefulness 
and/or validity will have been demonstrated. 

Using this Manua l 

This manual is not intended to provide a completely 
detailed description of decision-theor^etic testing. Many 
statements are made without an associated proof, and many 
parts of this manual assume some familiarity with decision- 
theoreLic testing, classical test theory, statistics, and/or 
intermediate algebra. This manual is intended to be 
technically accurate, but technical accuracy sometimes 
militates against simple explanations. 

For the most part, knowledge of FORTRAN IV is not 
required for running DEC-TEST. An exception to this 
general rule occurs in the d<2finition of object-time format 
statements (see Sections II and III). Also, some know- 
ledge of the IBM Job Control Language (JCL) is required 
(sec Section VI). 
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Terminology and notation with regard to decision- 
theoretic testing, at the present time, have not been 
standardized. For example, "true confidences" or "true 
probabilities," as used in this manual, have been called 
elsewhere "degree-of-belief probabilities," "state proba- 
bilities," "internalized probabilities," and "personal 
probabilities"; "observed probabilities" have been called 
elsewhere "reported probabilities" and "assigned proba- 
bilities." Whether or not the terminology and notation 
used here represents the "best" choice is open to question; 
however, it is the author's intention that the terminology 
and notation used in* this manual be consistent. One slight 
inconsistency known to the author is that the word "student 
is used interchangeably with "subject." 
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Sample Input and Output 

Sample input and output can be obtained from the 
author by writing to him at the following address: 

Dr. Robert L. Brennan 
Department of" Education 
SUNY at Stony Brook 
Stony Brook, New York 11790 

The author will also provide a source deck upon request, 
at a fee to cover cost of punching deck, handling and 
shipping . 
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I. Introduction to DEC-TEST and 



Decision - Theoretic Testing 

In the following paragraphs of this section, we 
provide an introduction to the subject of decision- 
theoretic testing, which allows us to establish a nota- 
tional scheme for subsequent sections. Also, we intro- 
duce the user to fundamental student test scores reported 
by DE'^-TEST. Finally, we discuss different kinds of item 
scores based upon decision-theoretic scoring, eliminaH:ion 
scoring, and classical scoring. 

Section V of this manual may be considered as a 
continuation of Section I, in that Section V provides a 
discussion of, and formulas for, all of the 102 
Individual Subject Scores reported by DEC-TEST. Tl-ms , 
some users may find it beneficial to read Section V 
immediately after Section I. 

Logarithmic Scoring System used by DEC-TEST 

In decision-theoretic testing, the student assesses 
the "confidence" he or she has in the correctness of each 
of the alternatives for each item and expresses this 
"confidence" (directly or indirectly) in terms of proba- 



observed probability for 
student h (h = 1, 2, N) , for 

item i (i 1, 2, . . . , K) , for 
alternative j (j = 1, 2, n^) , where 

number of students who took test, 

number cf items on test, and 

number of alternatives for item i, 
(2 <= n. <= 5). 

Note that, in DEC-TEST^ the number of alternati^^es for an 
item must be 2, 3, 4 cr 5- Now, it can be shown that if 
a linear scoring system (e.g. , sum of probabilit-i.^i 
associated with correct answer or a linear function of 
this sum) were used, then it would be in a student's best 
interest to use probabilities of 1.0 and 0.0, only, 
regardless of the student's "true confidence" in each of 
the alternatives. By "best interest" we mean that the 



bilities. Let 

N = 

K = 

n . = 
1 



student would , in the long run, maximize his or her score, 
given his or her actual knowledge of the answers to the 
test items , Tnus, a linear scoring system would motivate a 
stucfent to guess. 

In decision-theoretic testing we seek (as one of our 
goals) the elimination of guessing by defining a scoring 
system such that it is in the student's best interest to 
respond with probabilities that are isomorphic with the 
student's true confidence in each of the alternatives to 
every item, Shuford et al. CL9 66 ) have shown that, to 
fulfill these requirements, one can use a logarithmic 
scoring function defined as: 



(1) = A. log P^.^ + , where 

L^^ = log score for student i, 

log = log^Q, 

A* , B. = parameters for log scoring function 
^ ^ (discussed below and in Section III — 
Third Input Card) , and 

P . ^ = observed probability associated with 
^ correct answer (j = *) on item i for 
student h. 



Actually, L^^ has a lower limit equal to when P^^* - 

0.0. Therefore, we truncate the function at a convenient 
point C^, 0.0 <C^< 1.0, such that the lowest porsible 

value for is 

= A. log C. + B. . 
hi 1 ^ X 1 

Now, B. is actually the highest possible value of ^y^j^y 
so ^ the range of L^^ is 

A. log . 

DEC-TEST allows the user to specify as many different 

sets of values for A., B., and C as there are different 

11 1 

item types. An item type is defined as the number of alter- 
natives an item has. Thus, for example, if a test is 
composed of two and three alternative items, then the number 
of different item types is two, and two (possibly different) 
sets of values for A^, B^, and may be specified. 

^)''owever, in this section, for illustrative purposes, we 
g[^(^"ill use A^ = 30, = 100, and C^ s 0.01 for all items, i. 



regardless of the number of alternatives. The scoring 
function we will use is, thus, 



^i = ^hi* ' ^ 



00 



with a truncation value (lowest acceptable value of Pi.^*) 
of 0.01, and a range of 100 "points" for each item. 

Consider the item parameters and observed probabilities 
for student h in Table A-1. Note that w, is the weight for 
item i, and recall that ^ indicates the iorrect alternative. 
The log scores, for each of the items are given in 

Table A"-2. The weighted sum of these scores is 

= (1)(100) + (DOE) + ••• + (2X95) 
= 810, 

where "•«•" indicates "sum." The weighted average of the 
log scores is 

K 

(3) L. = L, . / E w. 

^- h+ .^^ 1 

= 810/10 
= 81, 

where indicates "average." 

■ 

Now, it can also be shown that: 

K K 

(^) U = A. log {[ n P, .^EXP(w.)]EXP(l/ I w.)} + B. 

n. I ^ hi** 1 1 1 

where "EXP" meann "exponential"; i.e., L^^ is the log 
score ■'"het results when the geometric * mean of the P^. ^ ^ 
(terms within braces in (U), aboveT replaces Pv^ • ^ in 
(1), above* For out illustrative data, the 
geometric mean is: 

[(1.00)-'"(0.05)-'- (0,U0)-^(0.80)^] * 

£DJ(^ EXP[l/(x + 1 + + 2 + 2)] 



TABLE A-1 
Illustrative Data ; 
Observed and Ad.lusted Probabilities 



Item Pareimeters 


Obs, Probs, 


Ad; 


1. Probs. 


i 




Phil 






A 

nil 


A A 

ni2 nl3 


1 2 




1,00 


0.01 




0.79 


O.lif 


2 2 




0.05 


0.95 




0.17 


0.76 


3 2 




0.60 


0.40 




0.53 


0.40" 


k 2 




0.45 


0.55 




0.43 


0.50 


5 2 




0.50 


0.50 




0.47 


0.47 


6 3 




0.20 


0.40 


0.40 


0.27 


0.40 0.40 


7 3 


2 1 


0.40 


0.30 


0.30 


0.40 


0.33 0.33 


8 3 


2 1 


0.80 


0.20 


0.01 


0.66 


0.27 0.14 
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= (0.00013824)EXP(0.1) 



= O.Ull 

L, and the geometric mean (sometimes transformed 
h . ^ 

linearly) are probably the tv/o most common scores for 
dec is ion- theoretic testing. 

Clarification * In the foregoing discussion of the 
logarithmic scoring system, we assumed that a student 
responded with, what we call here, "observed probabilities." 
Actually, DEC-TEST allows the user to employ any one of 
five different kinds of input, which are first converted 
to "original probabilities," ^j^^^ > ^^d then converted to 
observed probabilities Py^-^- These conversion proce- 

dures are discussed later'^in detail. Here we merely note 
that the task of converting input to original probabilities 
involves straightforward transformations, whereas the task 
of converting original to observed probabilities is one of 
resolving inconsistencies. A typical inconsistency results 
when the sum of the original probabilities does not equal 
unity. In this case, if the discrepancy is big enough 
(as defined by the user via the DCT parameter — see 
Section III), the user can perform either one "of two dif- 
ferent types of normalization procedures (see NORM para- 
meter in Section III) in order to produce the observed 
probabilities. The careful reader probably noted that 
for items numbered 1 and 8 in the illustrative data, the 
sum of the observed probabilities does not equal unity; 
however, the observed probabilities reported in Table A^l 
art. legitimate results given one type of normalization 
procedure available to the user of DEC-TEST. 

Real ism Line and Adjusted Probabilities 

One obvious question when using decision-theoretic 
testing is, "To wha+ extent is a student being realistic 
in the assignment of his or her probabilities?" If a 
student is totally realistic, then, for each of the pro- 
babilities he or she uses, the proportion of times each 
probability is correct will equal the probability itself. 
Graphically, as indicated by the solid line in Figure A-1, 
this implies that the Ideal ' ine (meaning ideal realism) 
has a slope of 1.0 and an intercept of 0.0 . To the extent 
that this is not true, then the student is unrealistic, 
to . ome degree. 



TABLE A-3 
Illustrative Datg: 
Proportion of Times that Distinct 
Observed Probability Values are Correct 



Observed 


Weighted Number of Times 


Proportion of 


Probability 


Observed Probability is: 


Times Pv^ij 




Used 


Correct 


Is Correct 


1.00 


1 


1 


UOO 


0.95 


1 


0 


0.00 


0.80 


2 


2 


1.00 


0.60 


1 


1 


T.OO 


0.55 


\ 


0 


0,00 


0.50 


2 


1 


0.50 


V-». 


1 


1 


1 .00 


0,kO 


5 


2 


0.i|0 


0.30 


k 


0 


0.00 


0.20 


3 


\ 


0.33 


0.05 


1 


1 


1.00 


0.01 


3 


0 


0.00 


Totals 


25 


10 
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FIGURE A'-l 
Illustrativ e Data 
Ideal Line and Realism Line 
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Viewing our observed probabilities as indicated in 
Table A-^S , we can plot the points in Figure A— 1 . Mow, 
using least square*^, analysis to obtain a best-fitting 
straight line for these points, we obtain the Realism 
Line in Figure A-1, which has a slope of 0.65 6 and an 
intercept of 0.]38 . (Formulas for the slope and inter- 
cept are provided in Section V. ) Clearly, student h is 
somewhat unrealistic; in fact, student h is somewhat over- 
confident. (Some thought will convince the user that 
whenever the slope of the Realism Line is less than 1.0, 
the student is "over-confident." See Section V for other 
indicatoi^s of over- and under-conf idence . ) 

Now, if the student had been more realistic, th 
student's observed probabilities would have been less 
extreme. For example, from the equation for the Realism 
Line, we note that, 0.138 + 0.656(0.80) = 0.66, which 
can be interpreted as meaning that the student would have 
been more realistic if he or she used 0.66 in place of 
0.80. These new probabilities are called "adjusted 
probabilities" and denoted 

(5) Pu- = * 6,P, , where 

hi] h h hi3 ' 

a = intercept of Realism Line for student h, 
and 

3. = slope of Realism Line for student h.^ 
n 

The set of adjusted probabilities for our illustrative 
data is provided in Table A-1 . 

Now, using (5), above, it can be shown that: 

A 1 A 

(6) P, = I P, 

hi+ hi3 

= n.OL + 3 . 
in n 



1 ... 

We do, however, m practice impose two constraints 

on ( 5 ) ; namely , if 



A A 



< , we set Pj^^j = , and if 



P, . . > 1.0, we set P, . . = 1.0 . 

hi] hi] 
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If all items on a test have the same number of alternatives, 
then the sum represented by (6), above, will equal 1.0 
for all items and for all students who took the test. 
However, for our illustrative data we have both two- and 
three- alternative items i therefore, the sum of the adjusted 
probabilities for any given item is not 1.0 . In fact, 
for this data, 

A 

Pj^^^ = 0.93 for items with two alternatives, and 

P, . . = 1.07 for items with three alternatives, 
ni + 



Now, recall that the use of the log scoring function 
enables a student to maximize his or her score if the 
student is realistic. Since the adjusted probabilities 
are "more realistic'* than the observed probabilities, it 
follows that, if 

^hi* ^epl^^^s P^.^ 



in ( 1 ) , then , for a reasonably lirge number of items , 

(weighted mean of adjusted log scores) should exceed 

Lj^ (weighted mean of observed log scores). The Lj^^ 

scores are found in Table A-2 , and their weighted mean 

A 

is about 82. Tims, by being more realistic, student 

n . 

h could increase his or her average log score by 
approximately 82 - 81 = 1 "point."''" 

Taking the geometric mean of the Pj^^>v using it 

in (1) gives approximately O.UU. Thus, the differeace 
is geometric mean probability scores is approximately 
O.Uti - O.Ul = 0.03. Note that differences of 1 ''point" 
and 0.03 are relativf^ly small; yet, the difference in 
slopes between the Ideal Line and the Realism Line is 
reasonably iar^ge (O.^UU). This discrepancy demonstrates 
the need for caution in over- interpreting the meaning 
of differences between slopes , ^spec ially for very small 
amounts of data. 



As indicat ed previously , the sum of the ad j us ted 
probabilities for an item is not necessarily equal to 1.0 
Therefore, adjusted log scores, based upon adjusted pro- 
bability scores, are somewhat biased when not all items in 
a test have the sam^ number of alternatives. However, in 
the author *s experience, any bias that exists in 

L, or L, is very slight, for most data. 
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Tterr- Scores Computed by DEC-TEST 



We have discussed above four different scores for 
item i for student h which are computed by DEC-TEST: 

observed probability associated 
with correct answer, 

observed log score (associated 
with correct answer) , 

adjusted probability associated 
with correct answer, and 

adjusted log score (associated 
with correct answer) . 

this manual, we use "a" to indicate 
use of adjust-ed probabilit : es • 

We will now consider six other item scores generated 
by DEC-TEST. 

Perceived Entropy and Perceived Information . One of 
the distinct advantages of dec is ion- theoretic testing is 
that the availability of probabilities associated with 
each alternatove for an item allows us to interpret 
student responses in terms of information theoretic 
principles (see, for example. Shannon 6 Weaver, 1949). Thus, 
perceived entropy for item i for shudent h can be defined 
as : 



^hi* ^ ^hi 
^hi* ^hi ^ 
^hi>^ ^hi 

^hi* ^ ^hi 

Note that, throughout 
that a variable makes 



n . 

Note that a gc od translation of "entropy" for our purposes 
is "uncertainty." Now, since the sum of observed proba- 
bilities for an item should equal unity, the maximum pos- 
sible amount of perceived information is: 

(8) hl^. log^ n. , 

which implies that perceived information for item i is 
given by: 

(9) I, . = MI, . - EN, . . 

hi hi hi 
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Using Tablp A.l, the user can verify the values for 
perceived entropy and perceived information given in 
Table A 2. Note that I^^ = 0.0 when all observed proba- 
bilities equal 1/n^ . (See, for example, Item No. 8.) 

Actual Entropy and Actual Information . Using adjusted 
probabilities we define actual entropy in a manner 
analogous to that used to define perceived entropy; namely, 

n^ 

(10) EN, . = - Z P, . . log^ P, . . . 

hi j hi] ^2 hi] 

Now, in order to define actual information, we need 
to know the maximum possible amount of actual information, 
which is, in general, 

(11) ^I^. = -(n.a^ ^ 6^)[log.(n.a^ . e^) - log£n.)] . 

In fact, (11) reduces to (8) when all items have the same 
number of alternatives, in which case the sum of ^ the 
adjusted probabilities for any item equals unity. 

Using (10) and (11), actual information is defined as: 

A /\ ^ 

(12) I, . = MI, . - EN, . . 

hi hi hi 

Using Table A— 1 , the user can verify the values for 
actual entropy and actual information given in Table A^2 . 
Note that, for this data: 
/< 

MI, . = 1.027 for items with two alternatives, and 
hi 

MI, . = 1.591 for items with three alternatives. 



Eliminati on Scores . Coombs et al. (1956) suggest 
a procedure For Rcor i a tp.r.t hnc:<^r} upon considering the 
alxernatives liinL a sL^UciiL eliminates, i.e., judges to be 
incorrect. For this scoring system, a student receives 
1/v.n. - 1) points for each incorrect alternative 



Even formula ( 11 ) is apt to be somewhat inaccurate 
in that we never allow an adjusted probability to be less 
that the truncation value for the log scoring function, and 
we never allow an adjusted probability to be greater than 
unity. However, any bias in (11) is usually very slight. 
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eliminated and loses 1.0 point for eliminating a correct 
alternative; thus, an unweighted item score falls between 
-1 . 0 and 1.0. 

Now, using the observed probabilities for item i, 
one can estimate a student's elimination score for 
item i, which we designate as E^^ • 

There is, however, one problem in this estimation 
procedure, which can be illustrated by considering the 
observed probabilities for Item No. U, 

^h41 ' ^"""^ ^hU2 = » 

in the illustrative data in Table A.l, If student h had 
taken this item under elimination scoring rules, would 
student h eliminate only alternat ive-1 , both alternatives, 
or neither alternative? Actually , if student h eliminated 
both alternatives or neither alternative, the elimination 
score for the item would be the same, namely, 0.0 . 
However, whether or not student h will eliminate alter- 
native-! depends upon whether or not, for elimination 
scoring purposes, student h would consider probability 
differences of 0.55 - 0.U5 = 0.10 meaningful and signi- 
ficant. 



Thus, in order to estimate an elimination score from 
the observed probabilities for an item, the user of 
DEC-TEST must first assign a value to 

RETO = tolerance for elimination scoring. 

This single value for RETO is used by DEC-TEST to 
estimate elimination scores for all items, for all students 
Letting PMAX be the magnitude of the largest observed 
probability for item i, for student h,. E . is determined 
by applying the following algorithm to each of the n. 

alternatives of item i for student h: 

(a) If (Pv... + RETO - PMAX) >= 0.0 , add 0.0 ; 

hi3 ' ' 

(b) If (P^-j RETO - PMAX) < 0.0 and j = *, 
subtract 1.0 ; or 

(c) If (P, + RFTO - PMAX) < 0.0 and j i 

hi] 

add l/(n. - 1) . 
Note that (b) and (c) indicate eliminated alternatives. 



A-13 



For example, if RLTO = 0.10, then student h would not elim- 
inate any of the probabilities for Item No. 4, resulting 
in =0.0 . See Table A-2 for other examples. 

Classical Scores. For the classical scoring system, 
the student is forced to pick one and only one alternative. 
Using the observed probabilities, we can estimate a 
student's item score for the classical" system. The 
procedure is as follows: if item i has I (I <= n.) 
highest probabilities, one of which is associated with 
the correct answer, then the classical unweighted item 
score for student h is 



^hi ' ^ otherwise, 

hi 



For example, since, for Item No. 5, both observed proba- 
bilities are 0.50 (one of which is obviously associated with 
the correct answer , since the item has only two alterna- 
tives), C, r = 1/2 = 0.50; i.e., if forced to pick only 
one alternative, student h has a 50-50 chance of picking 
the correct answer and getting 1.0 point. Note that 
C^g =0.0 , since neither of the two highest probabilities 

IS associated with the correct answer. 
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II. Summary of Control Cards 



and Student Data 



The control cards and student data constitute the 
input to DEC-TEST. All control cards except the Item Type 
Key, Answei' Key, and Item Cards must be on logical unit 
LUCC, which is specified in the main-line program of DEC- 
TEST. Usually, LUCC = 5f since most installations use 
logical unit 5 for reading punched 80-column cards (the 
usual medium for control cards) in FORTRAN; however; this 
value can be altered (see Section VI). 

No lonowledge of FORTRAN is necessary for setting up 
the control cards except for those control cards that are 
object-time formats. 

For most of the control cards, the following information 
is provided: (a) data field (card columns), (b) variable 
identification, and (c) a brief description of the variable 
and its possible values* 

Unless otherwise specified in the description of the 
variable, variables beginning with I, J, K, L, M, or N 
arc integers, and other variables are real variables. The 
values of integer variables should be right- justified inte- 
gers v/ithout any decimal point. The values of real variabiles 
should be (a) right Justified integers or (b) decjmal numbers 
including the decimal point* Technically, decimal numbers 
need not be right- justified, but it is usually desirable to 
right-justify them anyway. When A-format is specified (see 
KjN(5) and JT(i) in the First Input Card and the Item Cards, 
respectively), any alphabetic or numeric (alphanumeric) 
character may be used. 

Note that, for most of the control" cards, columns 1-10 
are not used by DEC-TEST. For such control cards, these 
co.'iumns may be employed by the user for card identification. 
Often, descriptions of variables in the following cards end 
with "(0,1)"* In these cases, the possible values for the 
variable are "0" meaning "no" or "absent" and "1" meaning 
"yes" or "presents" 

The user is cautioned that for integer and real variables, 
FORTRAN interprets blanks as "0" and "0.0", respectively. 



First Input Card (required) 



Colunms 
11-15 

16-20 
21-25 

26-30 

31-35 
36-/^0 

i»l-45 
^6-55 
56-60 



61-65 



66-70 



71-75 



Variable 
RUN(5) 

K 
N 

INC 
INCS 



I BOS 



IXTRA 

XMS 

MSD 



lOTH 



lOTD 



INVAR 



Description 

Five character run identifi- 
cation (A- format) 

Number of it erne 

Number of students (student 
records) 

0 = until end or file* 

No, of colurant; xor student 
Identification (Oi INC i 24) 
0 = no student Identifications 

No, of columns for sort 
(0 S INCS S£ INC £ Zk) 
0 = no sort 

First colximn for sort 
(if INC ;^ 0 and INCS ^ 0, 
1 i IBGS ^ INC S 24 and 
IBGS + INCS - 1 iINC) 

Additional student variable (0,1) 

Missing data code 

Technique for handling missing 
data for an item 

0 = alternatives for missing 

item transformed to 
observed probabilities 
of 1/no, of alternatives 

1 = sk^-p item* 

Number of object- time format 
cai'ds for Heading (1 i lOTH ^ 2) 

Number of object- time format 
cards for Student Data Input 
(1 .-S lOTD S. 10) 

Number of student VEiriables on 
Second Input Card(s) 

0 = ten default student 
variables 
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(2) Second Input Card (required) 



C olumne 
11-15 



Variable 
IPT 



16-25 
26-30 



56-60 

61-65 
66-70 

71-75 
76-80 



DCT 



NORM 



LUCD 



LUSD 



LUPT 



LUPT2 



LUPC 



Description 

Kind of Gtudent data used 
as input 

1 = probabilities 

2 = 100 X probabilities 

3 = star method 
if = log scores 

5 s log scores in which, 
for all items, 
= 50, = 100, 

s 0«01, and scores 

of 100 are input as 99 

Tolerance for decision- 
theoretic testing 

Normalization procedure 

0 = no normalization 

1 = normalize over all 

probabilities for 
item, except those 
equal to trimcation 
value 

2 = normalize over all 

probabilities 

LoGical unit for reading 
Item Cards or Item Keys 

Logical unit -.Jor reading 
St^'.dent Data Input 

Primary logical unit for 
printing 

Secondary logical unit 
for printing 

Logical unit for punching 



A-17 



ERIC 



(3) Third Input Card (required) 



Cclumns 


Va 'iable 


Description 


n-15 

16-20 
21-27 


AB(2, 1) 
AB(2,2) 
AB(2,3) 


Low score 
High score 
Truncation 


28-32 
3>37 
38-44 


AB(3,1) 
AB(3,2) 

*B(3,3) 


Low score 
High sccre 
Truncation 


45-49 
50-54 
55-61 


AB(I4,1) 
AB(4,2) 
AB(4,3) 


Low score 
High score 
Truncation 


62-66 
67-71 
72-78 


AB(5,1) 
ABC, 2) 

AB(5,3) 


Low score 
High score 
Truncation 



} 



two 

alternatives 



three 

alternatives 



four 

alternatives 



five 

alternatives 



Note : If scoring function to be used is such that 
log sc^re = 0.0 when probability associated 
with co.^rect ans^ver is l/no. of alternatives, 
then AB(nj^,3) may be left blank and will be 

automatically calculated by DEC- TEST. 

ik) Item Analysis Definition Card — Daclsion-Theoretic 
Scoring (required) 



Columns 

11-15 
16-20 



21-30 
31-^0 
i4l-50 
51-60 



Variable 

IDCV 

IDG? 



RDL( 1 ) 
KDL(2) 
HDL(3) 
ItDL(i4) 



Description 

Criterion variable number 

Groupting parameter 

1 = percent of subjects 

2 = score range 

1\1\1[1])lo^ group. 

LIMIT(2; J 1 Middle gp 

LIMIT(3)\Hiffh ffD -» 
LIMIT(4)/"^^ 



(5) 



Item Analysi.e Definition Card ~ Eliminate on Scoring 
(required; 



Columns 

n-15 

16-20 



Variable 

lELV 

IBGP 



Description 

Criterion variable number 

Groupting parameter 

1 = percent of subjects 

2 = score range 
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Columne Variable Description 

21-30 REL(1) LIMITCD't t^w ^r-^i,T> 

Jl-^O PEL 2) LIMIT(2)i ^"^"""Pl Mlrfril p 

51-60 REL(/4) LIMIT(Z*)-' "^^^^ *P 

61-70 RETO Tolerance for olimiiiation 

scoring 

(6) Item Analysis Definition Card — Classical Scoring 
C required) 

Co lumn s Variable Description 

11-15 ICLV Criterion variable number 

16-20 ICGP Grouping parameter 

1 = percent of subjects 

2 s score range 

21-30 RCL(l) LIMIT(1)\ 

31 -UO RCL( 2) LIMITC 2)/ «^^^P >i . ^ „ 

41-50 RCL(3) LIMIT(3)\ ^r^nn«> 

51-60 RCL(0 Limum)/ 

(7) First O utput Card (required) 

Columns Variable Description 

11-12 10(1) Print input 

0 = no 

1 = long version 

2 = short version 

13-13 10(2) Print and/or punch observed 

probabilities 

0 = no 

1 = print long version 

2 = print short version 

3 = print long version 

and punch all observed 
probabilities 

/+ = print short version and 
punch all observed 
probabilities 

5 = punch all observed 
probabilities 

15-16 10(3) Print scores for each indi- 

. vidual subject (0,1) 
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Columne 
17-18 



19-20 



21-22 
23-2i4 
25-26 



27-28 
29-30 
31-32 
33-34 
35-36 
37-38 
39-^+0 



Variable 



W9) 



10(10) 
I0U 1 ) 
10(12) 



10(13) 
10(14) 
10(15) 
10(16) 
10(17) 
I0O8) 
10(19) 



Description 

Print rosters of student 
scores 

0 = no 

1 = print minlrauDi (10, 

INVAR) scores 

2 = punch all INVAR scores 

3 = both 1 and 2 

Item analysis Indices for 
decision-theoretic scoring 



0 
1 



= no 



using observed proba- 
bilities and observed 
log scores 

2 = using adjusted proba- 

bilities ar.'' adjusted 
log scores 

3 = both 1 and 2 

Item analysis Indices for 
elimination scoring (0,1) 

Item anaJysis indices for 
classica. scoring. (0,1) 



Roster of item scores using 
observed probabilities 

0 = no 

1 = print 

2 = punch 

3 = both print and punch 
, using adjusted proba- 
bilities (0,1 ,2, Or 3) 

. using observed log 
scores (0, 1 ,2, Or 3) 
. using- adjusted log 
scores (0, 1,2, or 3) 
. using elimination 

scores (0, 1,2, or 3) 
. using classical 
scores (0, 1 , 2, or 3) 
. using perceived 
information (0,1,2, or 3) 
• using actual 
information (0,1,2, or 3) 
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Columns Variable Description 



10(20) Reliability analysis 

using observed probabilities 

(o.iT 

10(21) ••• using adjusted 

probabilities (0,1) 
10(22) ••• using observed log 

scores (0, 1 ) 
10(23) ••• using adjusted log 

scores (0,1) 
I0(2i+) ••• using elimination 

scores (0, 1 ) 
10(25) ••• using classical 

scores (0, 1 ) 
10(26) ••• using perceived 

information (0, 1 ) 
10(27) ... using actual 

information (0,1) 

10(28) Summary of reliability 

analyses (Including 
Livingston's coefficient) 
(0,1) 

(8) Second Output Card (required) 



Columns Variablg Description 

10-14 J0(1 ) These are subject variable 

15-18 J0(2) numbers used for the 

19-22 J0(3) rosters of subject scores; 

23-26 J0(4) DEC-TEST expects INVAR of 

27-30 J0(5) these variable numbers to 

31-3^4 J0(6) be specified. If INVAR > 15, 

35-38 J0(7) then use as many additional 

39-^2 J0(8) cards as may be -required, 

i*3-46 J0(9) Subsequent cards follov the 

^47-50 J0(1O) same format as indicated here, 

51-5i+ J0(n) The first variable on the 

55-58 J0(12) first subsequent card would 

59-62 J0(13) be J0(l6), the second J0( 17) , 

63-66 J0(lif) etc, 

67-70 J0(15) 



^1-^*2 

49-50 
51-52 

53-54 
55-56 

57-58 



A-2i 



ERIC 



(9) Third Output Card (required, but blank card may be 
used if I/i5( 287^0) 



Columns 



Variable 



Description 



1-5 



6-10 

1 1-15 
16-20 

21-25 
26-30 

31-35 
36-/+0 

^1-it5 

51-55 
56-60 

61-65 
•66-70 
71-75 
76-80 



CTT(1,1 ) 



CTT(1 ,2 

CTT(2,1 

CTT(2,2 

CTT(3,1 

CTT(3,2 

CTT(/|,1 

CTT(/|,2 

CTT(5,1 

CTT(5,2 

CTT(6,1 

CTT(6,2 

CTT(7,1 

CTT(7,2 

CTT(8,1 

CTT(8,2 



First criterion score for 
Livingston's Reliability 
Coefficient when reliability 
analysis uses observed 
probabilities 
Second ••• observed probs. 
First ••• adjusted probs. 
Second ••• adjusted probs. 
First ••• observed log scores 
Second ••• observed log scores 
First ••• adjusted log scores 
Second ••• adjusted log scores 
First elimination scores 
Second ••• elimination scores 
First ••• classical scores 
Second ••• classical scores 
First ••• perceived information 
Second perceived info. 
First ••• actual information 
Second ••• actual information 



(10) Object-Timo Format Card(s) for Heading (required) 

(11) Ob.iect-Timo Format Card( s) for Subject Data Input 
(required — ~ use A and F format) 



(12) Object - Tlmo Format Card for Answer Key and Item 
Type Key (roqulred. but blaiik card may be used if 
ITCDS = 1 — use I format) 

(13) Item Keys i>efinition Card (required) 
Columns Variable Description 



11-15 



ITCDS 



16-20 



ITKEY 



Item cards parameter 

1 ~ one item card for 

each item 

2 = answer key and item 

type key only 

Order of answer key and 
item type key 

1 = answer key first 

2 = item type key first 



21-25 



ITSCO 



Scores for each item (0,1) 
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Columns Variable 



Description 



26-30 ITDEC Decision-theoretic item 

analysis for each item 

0 = no 

1 = using observed probs. 

2 = using adjusted probs, 

3 = both I and 2 

31-35 ITFLI Elimination item analysis 

for each item (0, 1 ; 

36-^0 ITCLA Classical item analysis 

for each item (0, 1 ) 

Note: If ITCDS = 1, then the remaining parameters 

are ignored by DEC-TEST and may be left blank. 

(H, 15) Item Keys (required only if ITCDS = 2) 

No keys should be present if ITCDS = V, 
Both keys must conform to format specified 

by user in (12). 
Both keys must be on logical unit LUCD 
Ansv/er Key comes first if ITKEY = 1. 
Item Type Key comes first if ITKEY = 2." 

Answer K3y is IT(i,2), i = 1, 2 K. 

Item Type Key if is IT(i,3), i = 1, 2 K 

(16) Item Cards (required only if ITCDS = 1) 

No item cards should be present if ITCDS s 2 

Number of cards must equal K. 

Cards must be on logical unit LUCD. 

Each card must conform to the following format: 

Columns Variable Description 

7-10 JT(i) User-defined item identi- 

fication (A- format) 

U-15 IT(i,2) Correct answer 

(1 i IT(i,2) i IT(i,3) ) 

16-20 IT(1,3) Number of alternatives 

(2 f IT(i,3) i 5) 

21-25 RIT(i) Item weight — if all items 

have equr.l weight, let 
• RlT(i) = 1 
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Columns 
26-30 



Variable 



31-35 
36- /4O 



IT(i,5) 
IT(i,6) 



^6-50 



IT(i,7) 
IT(i,8) 



Description 

Split-halves key for 
reliability analyses 

0 = skip item 

1 = first half 

2 = second half 

Unweighted student scores 
on item (0, 1 ) 

Decision-theoretic item ' 
analysis 

0 = no 

1 =: using observed probs 

2 using adjusted probs 
5 both 1 and 2 

Elimination scoring item 
analysis (0, 1 ) 

Classical scoring item 
r.nalysis (0, 1 ) 



(17) Student Data (required) 



All data for one student constitute a student record 
which must conform to the format specified by 
the user in (11)« 

DEC-TEST expects N student records unless N = 0, 

in which cohO all student data Is read until 
an end of file code, is encountered^ 

All studerxC data must be or logical unit LUSD. 

(18) Era of File Card (required) 

The last card (or, preferably, the next to the last 
card) in the deck input to DEC-TES'T should contain 
/* in coiumnr 1 and 2, respectively, and blanks in 
th remaining 78 columns^ 

(ly) En- of Job Card (optional, hut desirable) 

Pi ferably, the last card in *he deck input to 
DF^-TEST should contain // In column.s 1 and 2, 
r( 5ectively, and blankti in the remaining 78 
C( ', . imns« 



III. Detail ed Description of C ontrol Cards 



The following pages provide a more detailed descrip.- 
tion of the parameters and variables defined by the user 
by means of the control cards, 

(1) First Input Card (required) 

RUN( 5 ) : a five-character alphanumeric run identifi- 
catioh printed in the top left-hand corner of printed 
output and punclied in the first five columns of punched 
output . 

K: the number of items to be analyzed, 

N: the number of student records input to DEC-TEST. 
N is not the number of cards containing student data. 
The Ob3ect~Time Format Card(s) for Subject Data Input 
defines one student record. If N = 0., then all student 
data is read until an end of file code is encountered, 
and DEC-TEST counts the number of student records. 

INC: the number of colunins for student identifica- 
tion. £ INC ^ 2^- If INC = 0, then the student data 
contains no student identification information, and 
DEC-TEST identifies students only according to a sequential 
student number; i.e., the first recoi^d of student data 
encountered is for student number 1, thp second record for 
student number 2 , the last record encountered is for 

student number N. 

INC S : the number of columns used for sorting student 
identifications . If student identifications are alphabetic 
(e.g., names), the sort results in an alphabetization 

for the number of columns specified. 0 £ INCb ^ INC £ ZU. 
if INCS - 0,then no sort is performed." If INC = 0, then 
INCS is ignored and may be left blank by the user. 

IBGS : the first column for sorf^'ng the student identi 
f icat ions . Unless INC = 0 or INCS = 0, the columns sorted 
are columns IBGS to (IBGS + INCS - 1) in the student identi 
fications. If INC = 0 or INCS = 0, then IBGS is ignored 
and may be left blank by the user. If INC ^ 0 and INCS 0 
then 1 < IBGS INC S 2U and IBGS + INCS - 1 2£ INC. 

IXTRA: If IXTRA = 1, then the Object-Time Format 
Card(s) for Subject Data Input specifies an additional 
student variable input to DEC-TEST. This variable is 
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available as a criterion variable for item analysis j it is 
treated just like any other student variable. If IXTRA = 0, 
then no additional student variable is included in the 
student input data. 

XMS and MSP : XMS is the code used to identify missing 
data m tEe set of data input tc DEC-TEST. XMS may be any 
veal number consistent with the format specified by the 
Object-Time Format Card(s) for Subject Data Input;; however, 
the user should be careful that XMS is not a valid score 
value for input. For example, if XMS is left blank, XMS is 
set to 0.0 by FORTRAN, but 0.0 may be a valid score value. 
Note that we are using the word "score" here to mean the 
score for alternative j on item i for student h. MSD is the 
technique for handling (processing) missing data for student 
h for item i. If MSD = 0 and the score values input for each 
of the n- alternatives for item i equal XMS, the;^. for the 
student under consideration, each alternative is assigned 
an observed probability of 

Pj^^j = 1/ni , j = 1, , . . . , n^ . 

If MSD = 1 and the values input for each of the n^ alter- 
natives for item i equal XMS, then, for the 
student under consideration, item i is skipped. 

lOTH : the number of Object-Time Format Cards for 
Heading . 1 i lOTH £ 2 . 

lOTD : the number of Object-Time Format Cards for 
SubjecFTata Input. 1 S. lOTD S 10. 

INVA R : the number of student variables on the Second 
OutpuF^fd(s) . 0< INVARS 102. If INVAR = 0, then any 
values on the Second Output Card are ignored furthermore , 
stude- * vai-^iables numbered 6 , 8 , 6 2 , 77. 72 , 35 , 90 , 100 , 
99, and lo:^ are reported in the rosters of student scores 
if I0(^) = 1 or 3 and/or punched out if I0(U) = 2 or 3. 



(2) Seconal Input Card (required) 

IPl;: the kind of student data used as inpu"?- to DEC-TEST 
The user can employ any one of five different kinds of input 
to DEC-TEST. The first three kinds of input are probabilitie 
or linear transformations of probabilities, and the last two 
kinds are log scores. Note that DEC-TEST expects an input 
score (perhaps XMS) for^ each alternative, for each item, for 
each student. The kinds of input are as follows: 

(a) If IPT = 1, then probabilities are used as input. 
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(b) If IPT = ? , then prcbabilitiey multiplied by 100 
are used as input. 

(c) If IPT ^ 3, then the star method is used as input. 
In one of the original articles on decision-theoretic 
testing, de Finetti (1965) suggested that students could 

be told to respond to an item by distributing five 
stars or points over the set of alternatives for an item. 
If, for example, an item has five alternatives, and a 
student assigns one star or point to each alternative, 
then the probabilities associated with each alternative 
would be 0.20. DEC-TEST allows any total number of stars 
or points o be used for any item. That is, the total 
number of stars or points may differ for each item and/or 
for each student. DEC-TEST adds up the numbers (stars) 
assigned to each alternative for each item by each student 
and uses th \.s total as the divisor for calculating, proba- 
bilities fo" the item for the particular student. 

(d) If IPT = U, then log scores are used as input. 
In general, the formula for a log score for student h 
on alternati.\^e j of item i is 

L^,. = A, log R^ij ^ B. 
with a truncation value of C. where 

X 

R . = original probability for student h, 
for item i, for alternative j. 

1 

From a practical point of view, it is usually V7ise to 
tell the students that they should use a single, instructor-- 
specified total number of points for all itemS on the test. 
Such a procedure simplifies the task for the students. Then, 
if any student uses a different total nujnber of points for 
any item, by intent or by mistake, DEC-TEST will routinely 
make appropriate adjustments, as indicated above. 
The avithor has found that students readily understand this 
response strategy, whereas they sometimes have difficulty 
when asked to respond with log scores (IPT = 4 or 5) 
directly. Also, the author has found that, when the items 
on a test have two, three, or four alternatives, twelve 
points is a convenient number to use for distribution over 
the alternatives of an item. Since twelve is divisible by 
two , three , and four , students can always indicate "no 
knowledge** (i .e. , a probability equal to 1/n^). Furthermore, 
twelve stars allow for a reasonably dense range of proba- 
bilities, and hence log scores. 
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Note that the original probabilities (designated by an upper 

case "R") are to be distinguished from the observed 

probabilities (designated by an upper case "P"); for further 

inforniation about this distinction, see Section I and th^ 

discussion of DCt . Also, noti the distinction between the 

log score for alternative j of item i, as given above, and 

the log score for item i, as given in Section I; i.e», the 

two log scores are equal only when j = * (the correct answer). 

The scoring function given above is actually a family of 

scoring functions, each member of which is determined once 

A., B., and C. are specified (see Third Input Card), 
1 ' 1 1 ^ 

In DEC-TEST, these parameters must be the same for dll 
items having the same number of alternatives. 

(e) If IPT = 5, then the scores used as input are 
log scores in which A. = 50, = 100, and the truncation 
value C. = 0.01 for ail items, regardless of the number of 
alternatives. Furthermore, a score of 100 is input as 99. 
Using this procedure, each score for each alternative 
occupies no more than two positions (e.g., two card columns). 
Thus, this procedure is very useful when students use a 
SCoRule (a device for converting probabilities to log 
scores with A^ = 50 , = 100, and C^ = 0.01) to record log 

scores which are then punched on cards for input to DEC-TEST. 



DCT: tolerance for decision theoretic testing. The 
first major computation performed by DEC-TEST involves 
converting (if necessary) the input student data to original 
probabilities R j^^. j . Note that, for this conversion 

procedure, any original probability less than the 
truncation value C^ is automatically converted to C^. 

If we let Z . . be a generic input score for student h, 
for item i, ^ for alternative j, then? 

(a) If IPT = 1, R^. . = Z^. . ; 

hi3 hi] ' 

(b) If IPT - 2, R^.. = Z, ../200 : 

hi] hi3 » 

(c) If IPT = R^.. = Z^../IZ^.. ; 

hi] hi] j hi] 

(d) If IPT ^ R^^. = 10.0 EXP[(B^ - ^hij^^^i-" ' 

(e) If IPT = 5, then whenever Z, . . = 99 and all other 

hi] 

scores for item i, for student h are 0, change 
the "99" to "100" and use the formula for R, . . 
in (d), above, letting A. = 50 and B. = 
100, ^ ^ 
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Now, theoretically , 

Z . . = 1.0 

j=l 

for each item, for each student. In practice, however, 
this is not always true^ for example, students sometimes 
err in recording their responses, which results in the 
sum of the original probabilities for an item not equaling 
1.0. If such errors are slight, then there is little- cause 
for concern; however, large errors can be troublesome and 
usually indicate that a student is not following directions 
for recording responses. If, for student h. on item i, 

n . 

1 

ABSCl.O - L . . ) > DCT, 
j=l 



where ABS means "absolute value,'' then DEC-TEST calls this 
a validity check for student h and allows the user to 
perform a normalization procedure for the original 
probabilities. For example, suppose that DCT = O.OU and 
the original probabilities are 0.00, 0.63, and^O.Ul. The 
probability 0.00 is automatically converted to 0.01; then, 
since 

ABSd.O - (0.01 + 0.63 + O.Ul)} = 0.05 > 0.04, 

a validity check occurs. 

NORM : the normalization procedure for items when a 
validity check occurs . Note that normalization is never 
performed for an item unless a validity check for that 
item occurs . 

(a) If NORM = 0, then no normalization is performed 
and the original probabilities are simply called the 
observed probabilities . 

(b) If NORM = 1, then normalization is performed over 
tnose alternatives for the item that are greater than the 
truncation value. For example, suppose C. = 0.01, 

DCT = O.OU, NORM = 1, and the original ^ probabilities 
for a three-alternative item are 0.01, 0.63, and O.Ul. 
A validity check occurs (see discussion of DCT) and the 
original probabilities are transformed to 0.01, 
0.63/l.OU = 0.6058, and O.Ul/l.OU = 0.39U2 . These new 
probabilities are called observed probabilities. Note that 
the sum of the observed probabilities not equal to 0.01 
(the truncation value) is exactly 1.0 . 
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(c) If NORM = 2 5 then normalization is performed over 
all alternatives, with the constraint that none of the 
resulting observed probabilities is allowed to bn less than 
the truncation value For example, if C = 0,01, DCT = 

0.0k, NORM = 2, and ^ the original ^ probabilities 

for a three-alternative item are 0,01, 0.63, and 0.41, ' 
then the observed probabilities become 0.01, 
(0.63)(0.99)/1.0U - 0.5997, and ( 0 . 41) (0 . 99 ) /I . OU = 0.3903 . 
If, on the other hand, the original probabilitieL were 
0.01, 0.55, and 0.42, then the observed probabilities would 
be 0.01/0.98 = 0.0102, 0.55/0.98 = 0.5612, and 
0.4 /0.98 = 0.4286 . Note that these new probabilities are 
called observed probabilities and they sum to exactly 1.0 . 

LUCD : logical unit for reading item cards or item keys. 

LUSD : logical unit for reading Student Data Input, 
including student identifications and the additional 
student variable, if present. 

LUPT : primary logical unit for printing all output 
excepFTTTat printed on LUPT2 . 

LUPT 2 ; secondary logical unit for printing, which is 
used to print Individual Subject Scores and the Summary of 
Reliability Analyses. 

LUPC : logical unit for punched output. Output 
designated as punched output can, of course, be written 
on any medium (e.g., cards, tape, disc, or drum); however, 
all "punched" records will be written as 80-column card 
images • 

(3) Third Inpu t Card (required) 

The third input card reads in val^ues for a matrix 
AB(n.,a), n^ ? 2,3,4, or 5 (number of alternatives for an 

item) and a = 1,2, or 3 (parameters for log scoring function). 
Recall that, in general, 

= A. log R . . + B- . 

hi] 1 ^ Til] 1 

In terms of the matrix notation introduced above 

B. = AB(n. ,2) , 
1 i' ' 

= AB(n^,3), and 
A^ = - [B^ - AB(n^,l)]/log . 
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From the last equation, it should be clear that 

AB(n^,l) = log + ; 

i.e., the lo\/est score for the log scoring function for 
item type i ;here, items having the same number of alter- 
natives have the same item type) is obtained when R 
is set to the truncation value C.. Thus, once the 
user specifies the low score, high score, and truncation 
value, DEC-TEST has or can calculate the log scoring 
function parameters A^, B^, and C^. For exeunple, if' 

AB(n-,l) = 0.0, 

AB(n. ,2) = 100.0, and 

AB(n. ,3) = 0.01, then 
1 

B- = 100.0, C. = 0.01, and 

1 1 

A^ = -(100.0 - 0,0) log 0.01 = 50.0 . 

If the scoring system for item type i is such that 

L, . . =0.0 when R*j: = Pv • = 1/n. , then AB(n.-|3) rnay be 
hij nij hij 1 1 

left blank and will be calculated by DEC-TEST using the 

formula : 



AB(n^,3) - 10.0 EXP 



[AB(n^,2) - AB(n.,ir / l] 
AB(n.,2) J \nj 



Clearly, in this case, the lowest score should be negative. 
For example, if 

AB(U,2) = 10.0 and AB(U,1) =*-10, then 

AB(U,3) = 10.0 EXP[(20/10) log (1/U)] = 0.062S, 

A^ = -20 / log 0,0625 = 16.6096, 

B^^ = 10, and C^^ = 0,0625 

If the test being analyzed does not make use of items 
having i alternatives, then AB(n.,l), AB(n£,2), and 
AB(n.,3) may be left blank. 
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(t|-6) Item Analysis Definition Cards (required) 



There are three kinds Oi Item Analysis Definition 
Cards. These cards define item analysis procedures when 
items are scored according to decision-theoretic scoring 
rules, elimination scoring rules, and classical scoring 
rules, respectively. The parameters defined by each card 
are identical in meaning, except for the inclusion of 
RETO (tolerance for elimination scoring) in the card for 
item analysis under elimination scoring rules. 

Criterion variable . Any one of the 102 variables 
described in Section V may be chosen, by number, as the 
criterion variable for item analysis. The criterion 
variable functions in a manner similar to total test score 
in typical item analysis procedures. Variables numbered 
77, 100, and 99 are common choices for decision-theoretic, 
elimination, and classical item analyses, respectively. 
Note that the "additional student variable" is available 
as a criterion variable for item analyses. 

G rouping and limits . In order to perform item analyses, 
subjects must be assigned to groups in either one of two 
ways* If the grouping parameter equals 1, the -first step 
in the grouping procedure involves rank ordering students 
on the criterion variable. Then, the lowest 



100.0CLIMIT(2) - LIMIT(1)]% 



of the students constitute the '*lower" group, the next lowest 
lOO.OCLIMITO) - LIMIT(2)]% 

of the students constitute the "middle" group, and 
the highest 

100.0[LIMlT(iJ) - LIMIT(3)]% 

constitute the "upper" group. For example, if LIMIT(l) = 
0.00, LIMIT(2) = 0.33, LIMITO) = 0.67, and LIMIT(4) = 1.00, 
then the lower group is the lowest 33% of the distribution, 
the middle group is the next lowest (middle) 34% of the 
distribution, and the upper group is the highest 33% of the 
distribution of students. 

If the grouping parameter equals 2, then letting a 
generic score for student h be S^, the lower group consists 
of all students for whom 
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LIMIT(l) 1:= < LIMIT(2), 
the middle group consists of all students for whom 

LIKIT(?) <= < LIMITO), 

and the upper group consists of all students for whom 

LIMIT(3) <= S, <= LIMIT(i+). 
n 

Note that when the grouping parameter equals 2, the limits 
chosen must correspond to potential values of the criterion 
variable if the grouping procedure is to have meaningful 
results . 

Whether the grouping parameter equals 1 or 2, any of 
the three groups can often be eliminated by setting that 
group*s limits equal. Note, however, that if either the lower 
group or the upper group contains no students, then several 
of the item statistics usually reported cannot be calculated; 
in such cases a sequence of *'s is printed for the uncalcu- 
latable statistics. 

RETO . RETO is the tolerance for elimination scoring. 
DEC-TEST must always have a value for RETO. If the user 
leaves RETO blank, then RETO is set to 0 . 0 automatically 
by FORTRAN. See Section I for a consideration of the 
function of RETO in elimination scoring. (Section I also 
describes the procedure for classical scoring given the 
observed probabilities . ) 

(7) First Output Card (required) 

The First Output Card, as described in the Summary of 
Control Cards and Student Data Section II,' is for the 
most part self-explanatory. Note, however, that*punch" 
should be interpreted as "write" on some medium (e.g. , 
cards, tape, disc, or drum) defined by the user. For 
further, information see the opening paragraphs of Section 
IV and the descriptions of each of the various outputs. 

(8) Second Outpu t Card (required) 

The Second Output Card lists, by number, the student 
variables to be printed and/or punched in the rosters of 
student scores (Output Nos. 6 and 36 in Section IV, respec- 
tively) • The number of student variables expected by 
DEC-TEST is specified by INVAR on the First Input Card. 



If INVAR = 0, then the Second Output Card may be left 
blank. 

Note that 0 <= INVAR <=102, but only fifteen subject 
variables can fit on one card; thus, more than one card may 
br required. Each required additional card follows the 
same format as the original Second Output Card. For 
example, if INVAR = 20, then two cards are required one 
containing the first fifteen subject variable numbers and 
a subsequent card containing the last five subject variable 
numbers in the first five fields. Also, note that the order 
in which the subject variable numbers are specified is 
immaterial. 

if INVAR = 0, then, as a default, the subject variables 
specified by DEC-TEST are numbers 6,8, 62, 77, 72, 95, 90, 
100, 99, and 101. These variables will be punched out if 
12) (U) = 2 or 3 and/or they will be printed (with verbal 
identifications as opposed to numerical iden^^if ications) 
if I0(U) = 1 or 3. 

If INVAR i 0, then all INVAR variables will be punche-: 
out if I0(U) = 2 or 3. If Ij8(i^) = 1 or 3 and if INVAR ^ 10, 
only the first 10 subject variables specified will b' printed 
in the rosters of subject scores. If 10(4) = 1 or o and 
0 < INVAR <= 10, all subject variables specif ie(^ will be 
printed. 

^9) Third Output Card (required) 

DEC-TEST allows the user to specify two criterion 
scores for calculating Livingston's Reliability Coefficient 
for each of the eight possible types of reliability 
analyses available in DEC-TEST. 

See Output No. 3U ("Summary of Reliability Analyses") 
in Section IV for a discussion of Livingston's Reliability 
Coefficients This is the only output that provides 
Livingston's Coefficients; these coefficients are not 
contained in Output Nos. 25-32 ("Reliability Analyses"). 

Livingstones Cof?f f icients will be printed (and, 
therefore, two criterion scores are required) only if 
10(28) = 1 and tho correcponding Reliability Analysis 
(controlled by 10(20) to 10(27)) is requested. If 10(28) = 0 
and/or if 10(20) to 10(27) are all equal to 0, then the 
Third Output Card may be left blank and will be ignored. 

Note that if DEC-TEST expects a criterion score and the 
user leaves such a score blank, then the criterion score is 
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set to "0", which is an acceptable value for calculating 
Livingston's ReJiability Coefficient. 

(10) Object - Tlifie Format CarcKs ) for Heading (required) 

DFC-TEST expects lOTH (see First Input Card) Object- 
Time format Cards for a Heading that is placed at the top 
of each page of printed output. In order to have a heading 
printed, the FORTRAN rules for object-time format statements 
must be followed. At a minimum, this card should contain 
the characters (IH ) • This card (these cards) should 
define only one line of print. 

(11) Object - Time Format Card(s) for Subject Data Input 
(required) 

DEC-TEST expects lOTD (see First Input Card) Object- 
Time Format Caid(s) for describing a record of Subject 
Data Input. All of the data for a particular subject 
should be contained in one record, and FORTRAN rules for 
object-time format statements should be followed. 

The principles that should be followed in- specif yiiig 
this object-time format are: 

(a) In general, the data in a student record should 
be ordered as follows: subject identification, responses 
for first item, responses for second item, responses 
for Kth item, additional student variable. 

(b) If INC = 0, then no subject identification is 
expected. If IXTRA = 0, then no additional student variable 
is «*xpected. (See First Input Card.) 

(c) Student identification information is specified 
in A-format. The number of A-format characters must jqual 
INC. 

(d) Item responses and the additional student variable 
(if present) are specified in F-format. 

(e) Note that t]ie format must take into account 
responses for all alternatives to all items. For example, 
if one has a 25-item test and each item has four alterna- 
tives, then (25) (U) = 100 responses are expected for each 
student; thus, the object-time format must specify 100 
responses, not 25, and each -student record must contain 100 
responses . 



(12-16) Cards Providing Item Information (subset required) 



These final control cards provide DEC-TEST with 
information concerning the characteristics of each item 
and the kinds of analyses for each item desired by the 
user. This information can be conveyed in two ways: 
(a) a "long description" that provides this information 
for each item individually or (b) a "short description" that 
provides similar information for all items at once. 

Long Descript ion . When the user wants to specify 
information for each item individually, then the user 
should let ITCDS = 1 in the Item Keys Definition Card and 
leave the remaining parameters on this card blank (if 
specified, these parameters are ignored anyway); leave 
blank the Object-Time Format Card for Answer Key and Item 
Type Key ( if a format is specified , the format is ignored 
anyway); not include an Item Type Key or an Answer Key 
no cards for these keys, not even blank ones ; and provide 
one Item Card for each of the K items. Each item card 
allows the user to specify for each item: 

(a) a four character cilphanumeric identification of 
the item that is printed on item analysis output. (A 
sequential item number is always printed on such output, 
whether or not the URer specifies an alphanumeric item 
identification. ) 

(b) the correct answer for the icem expressed as an 
integer between "1" and the number of alternatives for 
the item: i.e., 1 <=IT(i,2) <= IT(i,3). The correct 
answer must be specified for each item. 

(c) the number of alternatives for the item expressed 
as an integer between "2" and "5"; i.e., 2 <=-IT(i,3) <= 5. 
The number of alternatives must be specified for each item. 

(d) the item weight which may be any real number. 
If the item weight is set to "0" or left blank, then, in 
effect, the item under consideration is excluded in the 
calculation of studer^t scores. 

(e) the split-hr-»lvc^s paraiTieter, v;hich is used for 
reliability analyses, only. If this parameter is "1" or 
"2", then the item is placed in the first or second "half" 
of the test, respectively; \f this parameter is "0", 

then the item is eliminc\ted from consideration in reliability 
analyses. Note that the number of items in the first and 
second "halves" of the test need not be equal. 
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(f) four parameters for controlling whether or not the 
user desires each of the five possible analyses (or outputs) 
provided by DEC-TEST for each item. Each of these analyses 
(or outputs) is described in Section IV. 

Note that DEC-TEST expects item cards on logical unit 
LUCD (see Second Input Card). 

Short Description . The short description is '^short^' 
in that the K item caras are not required; however, cards 
12-15 are required. 

The Object-Time Format Card for Answer Key and Item 
Type: Key must be specified according to the FORTRAN rules 
for object-time formats. The format should be specified 
using I-format, not F-format. 

ITCDS in the Item Keys Definition Card should be set 
to "2", and the remaining parameters in the card must 
be specified. ITKEY allows the user to specify which 
comes first in the user's control cai'ds — the Item Type 
Key (ITKEY = 2 ) or the Answer Key (ITKEY =1). 
ITSC0, ITDEC, ITELI, and ITCLA are analogous to IT(i,5), 
IT(i,6), IT(i,7), IT(i,8) in the Item Cards. For example, 
if ITSC0 = 1, then Output No. 8 (Unweighted Student Scores 
on Item see Section IV) is printed for each and every 
item; the same result would occur if the long description 
were used and IT(i,S) = 1 for all K items. 

The Item Type K^^y and the Answer Key are placed in 
the control card deck in the order specified by ITKEY. 
They both must conform to the Object-Time Format Card for 
Answer Key and Item Type Key, and both keys must be on 
logical unit LUCD (see Second Input Card). 

Note that the short description does not allow the 
user to define alphanumeric item identifications, item 
weights, or split-halves. When the short description is 
used, the only item identification is a sequential item 
number (i.e., the firjt item in a i^tudent record is labelled 
item number 1, the second item is labelled item number 2, 

the last item is labelled item number K);^ 'all item 
weights are set to "1.0"; and an odd-even split-halves is 
automatically provided. 
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IV. Output 



DEC-T^ST provides different kinds of output. 
In this section we provide a brief description of these 
kinds of output in the order in which they are received 
by the user. This section is intended to be used along 
with sample output. 

Note that Output Nos. 1-3 2 are printed on logical 
unit LUPT, Output Nos. 33 and 3U are printed on logi^cal 
unit LUPT2 , and Output Nos. 35-4^ are punched on logical 
unit LUPC. Although we use the words "print" and "punch", 
the user controls the medium on which output is written. 
DEC-TEST has beon programmed so that outputs on LUPT and 
LUPT2 occupy no more than 13 2 characters in width, and 
outputs on LUPC are 80-column card images. 

Each of the different kinds of punched output is 
identified by a header card(s), and output printed on 
LUPT is identified by a page number in the upper right- 
hand corner. For all output, a sequence of *'s replacing 
the value of some score or variable indicates that the 
score or variable cannot be calculated for the particular 
set of data being analyzed. For example, if a -variable 
has a standard deviation of zero, the the correlation 
between this variable and some other variable cannot be 
calculated. 

Also, note that all standard deviations and variances 
reported are biased estimates - 

In the following pages we provide an output number 
(used only for the purposes of this manual), an output 
title, and a description for each of the different kinds 
of output generated by DEC-TEST, 

(1) Title Page always printed, 

(2) Control Card s — always printed 

This is ar int'^rpreted pseudo-listing of the control 
cards input to nfir^-TrsT, Kote in particular that the 
Item Type Key and An^^wer Key (used when ITCDS = 2) are not 
printed in the manner in which they are submitted to 
DEC-TEST, However, the information provided by these keys 
is contained in the lines that begin "DATA FOR ITEM NO"; 
also, when ITCDS = 2, the other values on these lines of 
printout are obtained from the Item Keys Definition Card 
or assigned automatically by DEC-TEST. 
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(3) Student Data ; Input 



This output is printed in its entirety if = 
If 10(1) = 2, then only the first three and last three • 
student records are printed. 

The total number of responses (in the sense of alter- 
natives) for each student is: 



K 



i=l 



^i 



Thus, for example, if there are 2 5 items each with four 
alternatives, then the 100 responses made by each student 
will be printed on nine lines (12 responses on the first 
eight lines, and U responses on the last line). If 
IXTRA - 1, then the additional student variable will be 
the last value printed for each student. 

The subject numbers reported are sequential subject 
numbers reflecting the order of the student records in the 
input data. 



) Messages 

This output is self explanatory. It is printed only 
if one or more students skip all items or all but one item. 

(5) Student Data : Observed Probabilities 

This output is printed in its entirety if 10(2) = 1 or 3. 
If 10(2) = 2 or U , then only the first three and last three 
sxudsnt records are printed. 

If the Messages output has occurred, then certain 
subjects have been eliminated ard, therefore, do not appear 
here. In this case, the number of students for whom proba- 
bilities are reported will be less than N, which is, techni* 
cally, the number of student records input to DEC-TEST. 
Since Output No. 5 constitutes the primary data matrix used 
to derive all future outputs, subsequent use of N in mdst 
outputs Cexceptions are clear from the context of the 
output) refers to the reduced number of students. Note also 
that, if the Messages output occurs, the subject numbers 
indicated in Output No. 5 usually will not correspond 
exactly with those indicated in Output No. 3 (Student Data: 
Input) . 



The probabilities reported have been normalized, if 
a validity check occurred and normalization was requested. 
If MSD = 1, and if a student skipped an item, then XMS 
appears in place of an observed probability for each of the 
alternatives for the item for that student. If MSD = 0 and 
a student skipped item i, then 1/n. appears as the observed 
probability for each of the ^ alternatives for item i 

for that student. 

The listing of the observed probabilities corresponds 
with the listing of Student Data Input (output No. 3), 
except for the fact that Student Data Input also contains 
the additional student variable if IXTRA = 1. 

(6) Roster of Student Raw Scores — printed if 10(H) 1 or 3 

For information concerning the actual variables printed 
out see Section III — Second Output Card(s). 

If a sort was requested, this output will reflect the 
result of the sorting procedure. Subject numbers to the left 
of the parentheses are the "sorted subject numbers"; i.e., 
numbers that indicate students' positions in the sorted 
roster. Subject numbers within parentheses are the 
sequential subject numbers from Student Data Observed 
Probabilities (Output No. 5). 

(7) Roster of Student z-S cores — printed if 10(H) = 1 or 3 

This output corresponds with Output No* 6. The scores 
reported are non-normalized z-scores with a mean of 0.0 
and a standard deviation of 1.0 . 

(8) Unweighted S tudent Scores on Item i printed if 
ITSCO = 1 or IT(i,51 I 

See Output No. 6 for a discussion of subject numbers. 
See Section I for a discussion of scores reported. Note 
that, if MSD = 1, then the sample size reported includes 
only those students who did not skip item i. This output 
provides the raw data for Output Nos. 9-12. The symbol 
"'^-'^ next to an alter*native in Output Nos. 8-12 indicates 
the correct ani^wer. Also, for these outputs, the first 
alternative is labelled "A", the second "B", etc.; 
the item numbers reported are sequential item numbers; 
and beside the sequential item numbers, in parentheses, 
are the user-defined item identifications, if any. 



(9) Item Analysi?^ for Item No* i Using Decision '- Thvoretic 
Scoring with Observed IPr obaFTTit ie s — printed if 
ITDEC ? T~5F 3, or if lT(i,6) = 1 or 3 

This output is divided into four parts and uses the 
data given in Output No. 8. The four parts are: 

(a) Ite^ Analysis Table. This is a three dimensional 
frequency distribution in which each entry represents the 
number of students in a particular group (lower, middle, 
upper, or total see Section III, Item Analysis 
Definition Cards) who responded with observed probabilities 
in a particular interval, for a particular alternative. 

Let N^, N^, and N^ be the number of subjects in 

tne lower, middle, upper, ail total groups, respectively. 
Now, the means and standard deviations L^low this table 
are based upon the observed probabilities not classified 
into intervals. Thus, for example, the mean observed 
probability for students in group g on alternative j is: 

g 



j = 1, 2, alternatives for item i, 

g = 1 , 2 , 3 , H groups , and 



h = 1, 2, N' subjects in group g. 

(b) Item Analysis Indices . The formulas for each of 
the nine indices are given below. Here (and elsewhere in 
this manual) j = * refers to the correct answer. 

(1) Arithmetic mean item score using observed 
probabilities : 

AMP = > 0 ^= AMP <= 1. 

(2) Difference discrimination index for arithmetic 
means using observed probabilities: 

DDAP = P. (3)* - P.(l)* ' ^^^^ '^'^ ^• 



A-'4l 



(3) Geometric mean item score using observed proba- 
bilities: 



GMP = C n PhCt)*'' EXP(i/N^) , 0 < 



= GMP <= 1. 



(4) Difference discrimination index for geometric 
means using observed probabilities: 

N' 



DDGP = {[ n P^o)*^ EXP(l/Np} 



- {[ n Ph(i)*^ EXP(l/Npl , 
h=l 

where -1 <= DDGP <= 1. 

(5) Correlational discrimination index using observed 
probabilities (CDF): the Pearson product moment 
correlation coefficient between observed proba- 
bilities associated with the correct answer and 
scores on the criterion variable. 

(^1 <= CD? <= 1.) 

(6) Average information (AVI): the arithmetic mean of 
the perceived information for each student 
reported in Output No. 8. (0 <= AVI <= log^ n^)^' 

(7) Arithmetic mean item score using observed log 
scores : 

1 \ 

AML = — [ Z ( i. log P^r..^ + B.)] 
h = l ^ "^^^ ^ 



= A^ log Gh? + 



1 K 



where, in general, 

L, r >^ = log score for student h in group g 
^ for the correct answer to item i, and 

A. log C. + B. <= AML <= B. 
1^11 1 

(8) Differnece discrimination index for arithmetic 
means tising observed log scores: 

DDAL = tL.(3)* - L.(l)*^/^-A. log C.) , 

where the denominator is the range of the log 
scoring function , and , therefore , 
-1 <= DCAL <= 1. 

(9) Correlational discrimination index using 
observed log scores (CDL): the Pearson product 
moment correlation coefficient between observed 
log scores associated with. the correct answer 
and scores on the criterion variable. 

(-1 <= CDL <= 1. ) 

(c) Pearson Pro duct Moment Correlation Coefficients . 
These are correlations between the probabilities 
associated with all possible pairs of alternatives for 
each of the four groups. 

(d) Frequency Distribution of Perceived Information . 
Note that the limits of the class intervals for informa- 
tion vary depending upon the number of alternatives that 
the item has. 



(10) Item Analysis for Item No . i Usin^ Decision - Theoretic 
Scoring with Adjustf^d ProbabTTTtie s printed if 

ITDeC = 3, or if lT(i. ' ) = 5 or 3 

The format for this output is identical to that for 
Output No. 9. For an explanation of Output No. 10, 
merely make the following replacements in the explanation 
for Output No. ^: "observed" becomes "adjusted"; 

"perceived" becomes "actual"; "P" becomes " P" ; and "L" 

becomes "L". 

(11) Item Analysis for Item No . i Using Elimination Scorin 
printed if ITELI = 1 or l7TT,7) = 1 

This output is divided into three parts and uses the 
data in Output No. 8. The three parts are: 
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(a) Item Analysis Table. The table is self-explana- 
tory, with the exception of average item score, "AVE IT 
SC", for group g which is: 

g 

ASE = T. E, . V where 
g h=l ^^^^ » 

E, ^ X = elimination score fc^ student h in 
^ group g (for item i) 

g = 1,2,3,4 groups (lower, middle, upper, 
and total, respectively), and 

N' = number of students in group g. 
g 

(b) Item Analysis Indices * The formulas for each of 
the four indices are given below. 



(1) Average item score: 
ASE = ASEj^ , -1 <r ASE <r l. 

(2) Standard deviation of item scores: 



1 ^4 (ASE)^ 
SDE = / [ Z E^x.x - ] , SDE >= 0. 

N'-i h=i ^^^^ n; 

(3) Difference discrimination index: 

DDIE = (ASE.^ - ASE^)/2.0, -1 <= DDIE <= 1. 

The denominator, 2.0 5 is the range of the 
possible elimination scores (unweighted) 
for an item, 

(U) Correlational discrimination index (CDIE) : the 
Pearson produc t moment corr'ilation coef f icient 
between el Imination scoi'es and criterion 
variablf^ or--; for i ^"^n. i. (-1 <- CDIE <" 1.) 

^ ^ ^ ^req uen cr/ I m otr ibut iou o i . ,11 Possibl e Coiabina - 
tions of E 1 i m i n a't e ( i A"l t e r n a 1 1 V e 5 ~ This part oT Output 
No. ll IS self-explanatory. 



(12) Item Ana lysis for Item No> i Using Classical Scoring 
— printed TflTCLA = 1 or 11X1,5") = IT 



This output is divided into two parts and uses the . 
data in Output No. 8. The two parts are: 

(a) Item Analysis Tabl e . This is, in essence, a 
standard item analysis table . The presence of fractional 
values can be explained through an example. Suppose that, 
for a three-alternative item, a student's observed proba- 
bilities are, in order, 0.40, 0.40, and 0.20. The student's 
classical score will be 0.50, 0.50, or 0.00 depending upon 
whether the first, second, or third alternative is the 
correct answer. Thus, 0.5 0 is added to the frequency counts 
for the first and second alternatives in order to provide 
our "best guess" concerning the number of students who 
would choose each alternative if students were forced to 
pick one and only one alternative. 

(b) Item Analysis Indices . Let us express the average 
item score for students in group g as: 



ASC 



1 



g 



C 



h(g) 



where 



g 




h=l 



C 



h(g) 



classical score for student h in group g, 



g 



1 > 2 , 3 , 4 groups ( lower , middle , upper , 
and total, respectively), and 



N' = number of students in group g. 
g 

T'le formulas for the four indices are as follows: 



(1) Average item score: 



ASC = ASCj^ , 0 <= ASC <= 1. 



(2) Standard deviation of item scores: 



SDC = 




h(4) 




SDC >= 0. 



(3) Difference discrimination index: 

DDIC = ASC3 - ASC^ , -1 <= DDIC <= 1. 

(U) Correlational discrimination index (CDIC): the 
Pearson product moment correlation coefficient 
between classical scores and criterion scores 
for item i. (-1 <= CDIC <= 1.) 



(13) Item Analysis Indices for Decision - Theoretic Scoring 
Using Observed P robabil ities and Log Scores — printed ir 
TISTTT = 1 or 3 

Indices are printed for item i only if ITDEC = 1 or 3 , 
or if IT(i,6) = 1 or 3. See Output No. 9 for a description 
of the indices reported. 

(lU) Item Analysis Indices for Decision - Theoretic Scoring 
Using'^' RHTust ed Probabiliti es and Log Sc ores — printed if 
rkTT = 5 or 3 

Indices are printed for item i only if ITDEC = 2 or 3 , 
or if IT(i,6) = 2 or 3. See Output No. 10 for a description 
of the indices reported. 

(15) Item AnalyvSis Indices for Elimination Scoring — 
printiTTf 1(5(10) 

Indices are printed for item i only if ITELI = 1 or 
IT(i,7) = 1. See Output No. 11 for a description of the 
indices reported. 

(16) Item Analysis Indices for Classical Scoring 
printed if 1(5(11) =1; 

Indices are printed for item i only if ITCLA = 1 or 
IT(i,8) = 1. See Output No. 12 for a description of the 
indices reported. 

(17-2U) Rosters of Students by Weighted Item Scores 

Output No. Data Used Printed if 



17 


Observed Probabilities 


10(12) : 


- 1 


or 


3 


18' 


Adjusted Probabilities 


10(13) : 


: 1 


or 


3 


19 


Observed Log Scores 


I0(1U) : 


= 1 


or 


3 


20 


Adjusted Log Scores 


10(15) : 


: 1 


or 


3 


21 


Elimination Scores 


10(16) : 


= 1 


or 


3 


22 


Classical Scores 


10(17) : 


: 1 


or 


3 


23 


Perceived Information 


10(18) : 


: 1 


or 


3 


24 


Actual Inofrmation 


10(19) : 


■■ 1 


or 


3 
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Note that these rosters report weighted item scores. 
Formulas for calculating the unweighted components of 
these scores are found in Section I. The unweighted scores 

are designated as P^.*, L^.,, L^.,, E^^. , Cj^. , 1^^. , 

and I,,^- f respectively. Their weighted counterparts are 
^i^hi*' ^i^hi*' Clearly, if w-^^ = 1.0 for i = 1, 2, 

K, then unweighted scores and weighted scores are identical. 

Ten scores are printed on each line. The first 'score 
reported is for item number 1, the second score for item 
number 2, etc. If MSD = 1 and an item was skipped by a 
student, then 999.999 is printed to indicate missing data. 
See Output No. 6 for a discussion of subject numbers. 

(25-32) Reliability: Analyses 



Output No. 


Data Used 


Printed 


if 


25 


Observed Probabilities 


10(20) : 


: 1 


26 


Adjusted Probabilities 


I«)(21) : 


: 1 


27 


Observed Log Scores 


10(22) : 


: 1 


28 


Adjusted Log Scores 


• 10(23) : 


: 1 


29 


Elimination Scores 


10(24) : 


: 1 


30 


Classical Scores 


10(25) : 


: 1 


31 


Perceived Inofrmation 


10(26) : 


: 1 


32 


Actual Inofrmation 


10(27) : 


: 1 


Each of the. above outputs provides a Hoyt 


Analysis 


Of 



Variance Reliability Analysis as well as a Split-Halves 
Analysis. In addition, for Output Nos. 25 and 26, DEC-TEST 
provides a Split Halves Analysis where the student score 
equals the geometric mean of the probabilities associated 
with correct answers. 

If MSD = 1 and any item scores are missing (identified 

as 999.999 in Output Nos. 17-24), then, for Output Nos. 

25-32, all missing item scores are transformed to the item 

scores that would result if P, . . = 1/n. . In effect, this 

hi3 1 

transformation has the same effect on Output Nos. 25-32 
as setting MSD = 0 in the First Input Card. 

In the following paragraphs, we explain the reliability 
analysis output, in general, and provide selected formulas. 
For these purposes let us define 
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^hi " unweighted generic item score for 
student h on item i, where 

h = X y 2 ^ * • ♦ > N and 
i - 1^ 2^ ••♦j • 

One caveat is in order. If the split-halves parameter for 
item i, IT(i,U), equals **0", then, for reliability analyses, 
item i is skipped. In this case, the actual number qf items 
used for reliability analyses will be less that K (the 
number of items input to DEC-TEST); hence, we use K* as 
the total number of items under consideration here. 

Hoyt Analysis of Variance . Many discussions of Hoyt's 
(19U1) procedure for calculating reliability are available. 
See, for example, Guilford (1954) which also provides an 
excellent treatment of most of the coefficients and scores 
reported in Output Nos. 25-32. 

The general element of the matrix that forms the raw 

data for calculating Hoyt * s Reliability Coefficient is 

w.X, The coefficient itself, which is identical to 
1 hi ' 

Cronbach's (l95l) Coefficient Alpha, is: 



r^^(Hoyt) = 1 



Mean Square (Remainder) 
Mean Square (Examinees) 



Now, to explain the way in which standard errors of 
measurement are reported in Reliability Analyses, let 

K' 

X,^ = Z w.X. . = weighted total scor-e for 
i=l ^ student h, 

K' 

X. = X, / I w. = weighted mean score for 
^ i=l ^ student h, 

SD(X, ^) = standard deviation of weighted student 
total scores, and 

SD(X^ ) = standard deviation of weighted stxident 
n.ean scores. 
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In general, the standard error of measurement is: 



SEM = - • 

When '*s" is replaced by SD(X, ^) in the above equation, wc 
get the standard error of measurement for student 

total scores, which is printed to the left of the 
parentheses in Output Nos, 25-32, When "s" is replaced 
by SD(X, ) We get the standard error of measurement for 
student mean scores, which is printed within parentheses. 
All standard errors of measurement are reported by DEC-TEST 
in a similar manner, 

Split - Halves * If IT(i,U) = 1, then item i goes in the 
first ^'half"; if IT(i,U) = 2, then item i goes in the second 
"half." In the table "STUDENT SCORE * SUM OVER ITEMS" :is 
analogous to X, ^ and "STUDENT SCORE = MEAN OVER ITEMS" is 
analogous to X^^ . Thus, SDCX^^^) and SD(Xj^ ) are 

referred to as the "STANDARD DEVIATION" for the "TOTAL" 
test when "STUDENT SCORE = SUM OVER ITEMS", and the 
"STANDARD DEVIATION" for the "TOTAL" test when "STUDENT 
SCORE = MEAN OVER ITEMS", respectively. 

Now, let 

r = correlation between two halves, 

s , - standard deviation of differences using 
total student scores (placed to left of 
parentheses in output) 

s, = standard deviation of differences using 

student mean scores (placed within paren- 
theses in output), 

a = proportion of total iteirt weights in first 
"half", and 

b 1 - a = proportion of total item weights in 
second "half". 

Using this notation, Horst's Reliability for Parts of Unequal 
Length can be expressed as: 



Rulon's, Flanagan's, or Guttman's Reliability is: 



r^^(Rulon) = 1 - {s^, / [SD(X^^)]^} . 

S plit - Halves w>iere Student Score Lquals Geometric 
Mean of Probabi li ties Associated with Correct Answer^ 
Since , the subject score is a (geometric) mean, all 
entries in the output th::t depend upon "STUDENT SCORE = 
SUM OVER ITEMS" are filled with *'s. The geometric • 
mean score for student h on the total test is: 

K' K' 

{ n rx, . EXP(w.)]} EXPd / z w.) , 

i=l ^ i=l ^ 

where X^^^ is replaced by P^^^* or P^^^ depending upon 

whether one is considering Output No. 25 or 26, respec- 
tively. Similar formulas can be constructed to calculate 
a student's score for the first and second "half" tests. 
Rulon's Reliability Coefficient is meaningless for this 
kind of data, and* therefore, all results depending upon 
it are replaced by "'s. The user should be aware that 
the validity of usinp; geometric means in a split-halves 
analysis is questionable, at best. 

(33) Individual Subject Scores — printed if 1(5(3) = 1 

DEC-TEST provides 102 scores for each individual 
subject, which are identified, in general, as VAR(l) 
to VAR(102). The most important of these scores are 
treated in Section I; all scores are considered in 
Section V and formulas for such scores are provided. 
Note that this output is printed on lo'gical unit LUPT2 , 
and page numbers are not provided. The calculations that 
produce this output are performed just prior to the 
printing of Output No. 6. 

(3u) SuirjTiary of Reliabilit y Analyse^ ( with addition of 
Livingston ' s Coef f a c ienr s 7. 

This output is printed if I0v28) = 1 and at least 
one of the parameters 10(20) to 10(27) equal 1. 
Furthermore, a summary is provided only if the corresponding 
"complete" Reliabilitv Analysis was reqi^ested. 
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The "USER DEFINED CUT-1" and USER DEFINED CUT- 2" 
values are the criterion scores supplied by the user 
in the Third Output Card. These values are used to 
calculate Livingston's (1972) Reliability Coefficient 
defined as: 

r^^ V(X) + (X - C)^ 

r^^(Liv) 5 , where 

vex) + (X - C)^ 

r.^ * any one of the reliability coeffi- 
cients reported in Output Nos. 
25-32, 

C = criterion score for Livingston's 
Reliability Coefficient, 

X X mean, over subjects, of scores, and 

V(X) = variance, over subjects, of scores. 

Now, reliability is, in genei^al, unaffected by 
whether the underlying student dcore is X^^ or Xj^ 

(See discussion of Reliability Analyses, Output Nos. 
25-32 for the notation used here.) However, for this 
output we use X^^ as the raw score for calculating 

X and for defining C ("CUT-1" or 'TUT-2" in output); thus, 
we use : 

— IN 

X = X = - X, , and 

N n = l 

V(X) = [SD(X, )]^ , 
n . 

This choice results in X having clearly defined limits. 
Specifically, 

when the scores used are the limits of X are 

observed probabilities 0 < = X <= 1 

adjusted probabilities 0 <= X <= 1 
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when the scores used are the limits of X are 

n 

observed log scores ^ A. log C. + B, 

or 111 



adjusted log scores J <s X <= 

elimination scores -1 <= 3( <= 1 

classical scores 0 <= X <= 1 

2 ~ 

perceived information 0 <= X 



2 r K» 

actual information 



I w. log^ n- 
i=l ^ ^ ^ 



< = 



K' 
i = l ^ 



Technically, the limits provided are only an approx- 
imation for adjusted log scores for a test in which not 
all items are of the same item type (i.e., liot all items 
have the same number of alternatives). 

Furthermore, for both observed and adjusted log scores, 
the limits provided are exactly correct only if A^, B^, and 

are identical for each item. If this is not 
ti^ue, then the lower limit is: 

[ E w. (A. log C- + B-)] / E w. 
1 = 1 ^ ^ 1 ^ i=l ^ 

and the upper limit is: 



E w.B. / E 
i=l ^ ^ i=l 



The experience of the author indicates that, eventhough 
the limits provided In the body of the text may not be 
exactly correct, for a given test, these limits are almost 
always a good enough approximation for practical use. 

Technically, the upper limit provided is only an 
approximation for actual information for a test in which 
not all items are of the same item type; nevertheless, 
even in this case, the author's experience indicates that 
the limit provided in the body of the text is a good 
enough approximation for practical use. 
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Nov;, referring to the fomula for r^^(Liv), note that 

r^^(Liv) - v^^ if X = C ; 

i^e., Livingston alleges that his coefficient gives the 
reliability that would result if C were the mean of the 
test. Thus, when choosing potential values of C for the 
Third Output Card, the user should choose values within 
the limits reoorted above for the various interpretations 
of X- 

In short, this output reports 

^tt " r^^(Liv) when C = 7, 

r^^(Liv) when C = "CUT-1" , and 

r^^(Liv) when C = "CUT-2'' 

for each of the different reliability coefficients (r. . ) 
reported in Output Nos. 25-32. 

(35) Observed Probabilities - Punched — punched if 10(2) = 
3, 4 , or 5 

Note that, if 10(2) >= 3, then all observed probabili- 
ties for all subjects are punched. There is no provision 
for punching out observed probabilities for the first three 
and the last three students, only. Thus, this output is 
analogous to Output No. 5 when 10(2) = 1 or 3. 

The format for card output is as follows: 



Columns Description 

1-5 Run identification, RUN(5) 

6 Blank 

7-30 Student identification 

31-3 3 Sequential card number for 

student = SCN 

34-35 Blank 

36-40 Observed probability: (1)(SCN) 

41-1+5 Observed probability: (2) (SCN) 

76-80 Observed probability: (9) (SCN) 
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Note that each card contains a maximum of nine 
observed probabilities (punched using format F5.3). 

(36) Roster of Student Raw Scores - Punched — punched if 
10(4) =~T^r^ 

For information concerning variables punched out, se 
Section III — Second Output Card(8). Output No. 36 is 
siiiiilar to Output No, 6. 

The format for card output is as follows: 

Columns Description 

1-5 Run identification, RUN(5) 

6 Blank 

7-30 Student identification 

31-3U Sorted student number 

35 ( 

36-39 Sequential student number 

kO ) 

U1-U3 Sequential card number for 

student = SCN 

45-53 Score number: (IXSCN) 

51+-62 Score number: (2) (SCN) 

63-71 Score number: (3) (SCN) 

72-80 Score number: (U)(SCN) 

Note that each card contains a maximum of four 
scores. The scores are punched in the order indicated 
by the sequence of variables in the Second Output Card(s) 
Scores are punched using format F9.3 . 
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(o7-44) Posters of Students Weighted Item Scores - 
Punched 

DEC-TEST provides eight such punched rosters. 



Output No. Data Used Punched If 

37 Observed Probabilities 10(12) = 2 or 3 

38 Adjusted Probabilities 10(13) = 2 or 3 

39 Observed Log Scores I0(1»+) = 2 or 3 
•♦0 Adjusted Log Scores 10(15) = 2 or 3 
•♦1 Elimination Scores 10(16) = 2 or 3 
•♦2 Classical Scores 10(17) = 2 or 3 
•♦3 Perceived Information 10(18) = 2 or 3 
•♦«♦ Actual Information 10(19) = 2 or 3 

These outputs are analogous to Output Nos. i7-2U, 
respectively. 

The format for card output is as follows : 

Columns Description 

1-5 Run identification, RUN(5) 

6 Blank 

7-30 Student identification 

31-34 Sorted student number 

35 ( 

36-39 Sequential student number 

40 ) 

ft 

41-4 3 Sequential card number for 

student = SON 

'+'+-52 Item number: (1)(SCN) 

53-61 Item number: (2)(SCN) 

62-70 Item number: (3)(SCN) 

71-79 Item number: (4)(SCN) 



Note that each card contains a maximum of four 
scores (punched using format F9.3) . 
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V. Individual Subject Scores 



DEC-TEST provides 102 Individual Subject Scores 
(for each student), labelled VAR(l) to VAR(102) in 
Table Many of these scores have been introduced 

in Section I, and formulas for all scores are provided 
later in this section. Note that all scores except VAR(l), 
VAR(lOl), and VAR(102) take item weights into account. 
Table A-5 provides the Individual Subject Scores Output 
(Output No. 33) for a hypothetical student using the- 
illustrative data introduced in Section I. 

For the purposes of discussion, we will divide the 
Individual Subject Scores Output into five parts, and 
discuss each part separately. 

VAR(2) to VAR(16) ; Variables Relating to Reference Lines 

The Ideal and Realism Lines have been discussed in 
Section I. The extent to which a student is unrealistic 
is indicated by VAR(8) as well as by 

VAR(5) - VAR(6) = 1.0 - VAR(6) . 

However, note that: 

VAR(8) >= 0.0, whereas 

1.0 - VAR(6) = 0.0 if student is completely 

realistic , 

1.0 - VAR(6) > 0.0 if student is over-confident, 

and 

1.0 - VAR(6) < 0.0 if studerrt is under-confident. 

Therefore, VAR(8) in a measure of the magnitude of unrealis- 
tic student performance; whereas, 1.0 - VAR(6) is a measure 
of both the magnitude and direction of unrealistic student 
performance. For the illustrative data, VAR(8) = 11.7U99 
degrees and 1.0 - VAR(6) = 1.0 - 0.65563 = 0.3UU37. 
Thus, this hypothetical student is over-confident. 

The Base Line in the estimated "realism" line if the 
student had always assigned a probability of 1.0 to a single 
alternative for each item and 0. 0 to the other alternatives. 
In a sense, therefore, the Base Line is the "realism" line 
for classical scoring as opposed to decision-theoretic 
scoring. 
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TABLE A-A 

Variable Numbers 
Scores for an fndividual Subject 



IDEAL LINE 
REALISM LINE 
BASE LINE 



INTERCEPT 
VAR(2) 
VAR(3) 
VAR(it) 



SLOPE 
VAR(5) 
VAR(6) 
VAR(7) 



DEC. DEG. DEG. MIN. 

ANGLE BETWEEN IDEAL LINE AND REALISM LINE VAR(8) VAR(ll) VAR(l/f). 

ANGLE BETWEEN IDEAL LINE AND BASE LINE VAR(9) VAR(12) VAR(15) 

ANGLE BZl'WEM REALISM LINE AND BASE LINE VAR(IO) VAR(13) VAR(l6) 



PROBABILITY 

INTERVAL 
0.0<=P< 0.1 
0.1<=P< 0.2 
0.2<=:P< 0.3 
0.3<=:P< O.k 
0./+<sP< 0.5 
0.5<=P< 0.6 
0.6<s:P< 0.7 
0.7<=P< 0.8 
0.8<=P< 0.9 
0.9<=P<=1.0 



NO. TIMES 

USED 
VAR(17) 
VAR(18) 
VAR(19) 
VAR(20) 
VAR(21) 
VAR( 22) 
VAR(23) 
VAR(24) 
VAR(25) 
VAR(26) 



NO. TIMES 
CORRECT 
VAR(27) 
VAR(28) 
VAR(29) 
VAR(30) 

VAR(30 
VAR(32) 
VAR(33) 
VAR(34; 
VAH(35) 
VAR(36) 

IDEAL LN 
VAR(47) 



PROPORTION 
CORRECT 
VAR(37) 
VAR(38) 

VAR(39) 
VAR(ZtO) 
VAR</tl) 
VAR(/t2) 
VAR(Zf3) 
VAR(Zf/f) 
VAR(it5) 
VAR(if6) 



AVERAGE S.S. OF DEVIATIONS FROM 

OVER ALL ITEMS 



REAL. LN 
VAR(Zf8) 



BASE LN 



PER ITEM 



ENTROPY (UNCERTAINTY) 

INFORMATION 

MAX. POSSIBLE INFO. 



ACTUAL 
VAR(50) 
VAR(51) 
VAR(52) 



PERCEIVED 
VAR(53) 
VARC54) 
VAR(55) 



ACTUAL 
VAR(56) 
VAR(57) 
VAR(58) 



PERCEIVED 

VAR(59) 
VAP(60) 
VAR(bl) 



COEFFICIENT OF BIAS 

« 


= VAR(62) 








LG SC 


LG SC 


AR MN 


GM MN 




OVER 


PER 


PROB. 


PROB.- 




ITEMS 


ITEM 


SCORE 


SCORE 


POSSIBLE IMPROVEMENT FROM: 










BETTER USE OF INFO. 


VAR(63) 


VAR(72) 


VAR(8l) 


VAR(90) 


MORE INFORMATION 


VAR(6Zt) 


VAR(73) 


VAR(82) 


Var(9l) 


SCORE RESU TING FROM: 










BETTER USE OF INFO. 


VAR(65) 


VAR(7A) 


VAR(83) 


VAR(92) 


MORE INFORMATION 


VAR(66) 


VAR(75) 




VAR(93) 


TO'i'AL POSSIBLE IMPROVEMENT 


VAR(67) 


VAR(76) 


VAR(85) 


VAR(9it) 


od;;erved score 


VAR(68) 


VAR(77) 


VAR(86) 


VAR(95) 


HKiHEST POSSIBLE SCORE 


VAR(69) 


VAR(78) 


VAR(87) 


VAR(96) 


SCORE STUDENT EXPECTS 


VAR(70) 


VAR(79) 


VAR(88) 


VAR(97) 


SCORE FOR NO KNOWLEDGE 


VAR(.71 ) 


VAR(80) 


VAR(89) 


VAH(98) 


CLASSICAL SCORE = VAR(99) 




ELIMINATION 


SCORE = 


VAR(IOO) 


NUMBER 


OF VALIDITY CHECKS a 


VAR(IOI) 





NUMBER OF ITD1S SCORED = VAR(102) 
ADDITIONAL STUDENT VARIABLE « VAR(1 ) 
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TABLE A-5 



Illustrative Data ; 
Scores""for individual Subject 



lOtAL LINE 
RFALISM LINE 
BASf- LINE 



INTEKEPT 
CD 

0.1 n ?i 

0.2323 J 



I. 000 DO 
Q. 65563 
3 .4^l667 



DEC. ^e-;. OEG. MIN. 



ANGLC- BETWEEN IDEAL 


LT'^JF ANO REALI S ^ LI \c 


ll.7^J) 11. 


A5. - 


ANGLF BETWEEN IDEAL 


LINE AND BASE 




22.3331 22. 


23. 


ANCLE BETWEEN REALISM LINE AND B*St LI^E 


13.6332 10. 


38. 


PROOABIL ITY 


NO. TIMES 


SJ. II^ES 


PRJPGRT ICN 




INTERVAL 


USED 


CQRfleCT 


CORRECT 




0.0<«P <o.i 


4.0 


1 . ) 


0 .2500 




0.1<»P <0.2 


0. 0 


},} 


J. 0 




0.2<*P <0.3 


3.0 


1 .0 


0. 3333 




0.3<»P <0.4 


A.O 


3. 3 


0.0 




0.4<aP <0.5 


6. 0 


4 . ) 


3. 5000 




n.5<*P <0.6 


3.0 


1.3 


3.3333 




0.6<*P <0.7 


1.0 


1 .3 


1.0000 




Q.7<»P <0.8 


0.0 


3. ) 


3.0 




0.8<«P <0.9 


2. 0 


2.) 


I. 0000 




0.9<=P<*1.0 


2.0 


1.3 


J .5000 








IQPiL LN 


REAL. LN BASE LN 


AVFRAGE S.S. OF OEVIATIONS FROM 


J.IS52 


'3.1127 


C. 0899 




OVER ALL 


I le-ts 


PFR ITEM 




ACTUAL ? 


6^cei\(=i3 


ACTUAL PERCEIVED 


ENTROPY (UNCERTAlNTYt 


1 1.6678 


?.53n 


1. 1668 


0.9557 


INFORMAT ION 


1.2570 




J. 1257 


0.3368 


MAX. POSSIBL F INFO . 


12.S2A8 




1.29 2 5 


1.292 5 



COEFF IC lENT OF HI i 3 



16.330 





Lr, iC 


LG SC - 


AR MN 


GM MN 




0V = :^ 


PFR 


PROfl. 


pr'cb. 




I TEMS 


ITEM 


SCORE 


SCORE 


POSSIBLE IMPKGVEMENT FROM: 










PETTEP USE OF INFO. 


14. C 


1.479 -0.041 


0. 029 


'•^'JRf. INFORMATION 


1 78.1 f e 


Lf.8l;l 


0 .521 


0. 560 


«;CnRP RESUITINO FROM: 










HFTTER USE OF INFQ. 


821 .^22 


.H2 


0.A79 


0.44O 


MORE INFORMATION 


985.21 C 


i 3.521 


I .041 


0. 971 


irHAL POSSIBLE IMPRQVFMFNT 


1^2. Si£ 


L J .2 J7 


0 .480 


U. 5 99 


OHSTRVED SCORE 


807.032 


3 }.703 


0. 520 


0.411 


HIGHFST POSSIBLE SCORE 


lOCG.CCC 


10 ).003 


1 .000 


1.000 


SCITRE STUDENT EXPECTS 


856.1 j 1 


£5.615 


3.533 


0. 516 


SCORE FOR NU KNOWLEOGC 


805.^162 


S3. 546 


0.417 


0.408 


CLASSICAL SCORE « 6,500 




SLI MI NAT ICN 


SCORE « 


2.000 


NUMBER 


OF VALITITY CHECKS » 


3. 




NUMBER 


OF n-iJ'S 


SCQUO « 8 . 
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ADDITICNAL SFlJlg^T VARIAffLE « »♦♦♦»*♦ 
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VARC17) to VAR(49) : Distribution of Observed Probabilities 



VAR(17) to VAR(U6) constitute the distribution of 
observed probabilities collapsed into ten class intervals 
of length 0.10 These variables, therefore, provide the 
grouped data version of the kind of information presented 
in Table A*3 for the illustrative data. 

VAR(47) to VAR(49) provide the average sums of 
squares of the grouped data points about each of the 
three reference lines. VAR(*+7) and VAR(*+9) are of 
questionable utility; however, VAR(48) provides an indi- 
cation of the extent to which the least squares Realism 
Line is a good fit for the observed probability data 
points. 

VAR(50) to VAR(62) : Information Theoretic Measure of 
Student Performance 

This section of the output contains three parts: 

(a) information measures over all items (i.e., for the test); 

(b) information measures per item (i.e., for the average 
over items); and (c) the Coefficient of Bias. 

Part (a) can be graphically displayed as the Information 
Square in Figure A*2 , which can be interpreted in terms of 
the Arabian proverb: 

He who knows and knows that he knows. 
He is wise, follow him. 

He who knows and knows not that he knows. 
He is asleep, awaken him. 

He who knows not and knows not that he knows not, 
He is a fool, shun him.* 

He who knows not and knows that he knows not , 
He is a child, teach him. 

Since each variable in part (b) is a simple function of 
a corresponding variable in part (a), part (b) can also 
be graphically displayed in terms of the Information 
Square. 

The Coefficient of Bias, VAR(62), provides another 
indication of the extent to which a student is unrealistic. 
Note that : • 
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FIGURE A- 2 
Illustrative Data: 



Information Square 



VAR(52)= VAR(55) 
12.9248 ^^^^ ^ 12.921*8 




Child 



Note. --The top horizontal line re^presents maximum 
possible actual and perceived information; the bottom 
horizontal line represents no infor^mation . The dashed 
line connects perceived and actual information. 
Numerical values reported are for the illustrative 
data. Be careful to grapg "information," not "entropy 



-lOO.O <= VAR(62) <= 100.0 



VAR(6 2) = 0,0 if student is completely 
realistic , 

VAR(62) > 0.0 if student is over-confident, and 

VAR(62) < 0.0 if student is under-confident. 

For the illustrative data, VAR(62) = 16.330; therefore, 
the student is over-confident, which is consistent with 
the conclusion we reached when we observed that 
1.0 - VAR(6) = 0.31*437 > 0.0 . Another indication of 
over-confidence, on the part of our hypothetical student, 
is provided by the fact that the slope of the dashed line 
in the Information Square is greater than 0,0. 

The user should be cautious in the interpretation of 
information and entropy measures in that the scale for 
these measures is non-linear (specifically, logarithmic — 
base 2); hence, it is easy to fall into the error of over- 
and/or under-interpreting differences in magnitudes for 
these measures.! 

VAR(63) to VAR(98) ; Primary Test Scores 

This section of the output provides four sets of nine 
scores each, involving: (a) log scores over all items; 

(b) log scores per item (average over all items); 

(c) arithmetic mean probability scores; and (d) geometric 
mean probability scores. In general, parts (c) and (d) 
involve taking probabilities associated with the correct 
answers and calculating arithmetic and geometric weighted 
means , respectively . Note that all scores reported take 
item weights into account. 

» 

Parts (a), (b), and (d) provide essentially the same 
information using different measurement scales. Perhaps 



Another consideration is that, when not all items 
have the same niiinber of alternatives, the maximum possible 
amount of actual information is not equal to the maximum 
possible amount of perceived information (see Section I). 
DEC-TEST handles this discrepancy by transforming actual 
information to the scale of perceived information (eee 
formula for VAR(50))* 
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the most interesting scores reported are those that reflect 
a partitioning of total possible improvenient (i.e., increase) 
in test score into improvement if the student makes better 
use of his or her information (i.e., if the student is more 
realistic) and improvement if the student had more informa- 
tion. Thus, in effect, the user can provide the student 
with quire detailed information concerning what the student 
might do to improve his or her score, as well as the poten- 
tial effect of such action. Figure A-3 provides a student 
profile for selected geometric mean probability scores, for 
the illustrative data introduced in Section I. All nine 
geometric mean probability scores reported in the output 
are directly or indirectly represented in Figure A-3. 
Similar profiles could be constructed for the log scores 
in parts (a) and (b) of this section of the output. 

Part(c) provides arithmetic mean probability scores. 
These scores are provided for comparative research purposes, 
only. Such scores are not appropriate for decision-theoretic 
testing, since they are based upon a linear scoring system. 
One glaring indication of this lack of appropriateness is 
that "POSSIBLE IMPROVEMENT FROM BETTER USE OF INFORMATION" 
is almost invariably a negative score indicating that 
students would almost always get lower (arithmetic mean 
probability) scores if they were more realistic.^ 

VAR(99) to VAR(102) , VAR(l) : Secondary Test Scores 

The formulas for these scores are either self- 
explanatory or they provide a reference to an explanation. 



-^On rare occasions, "POSSIBLE IMPROVEMENT FROM BETTER 
USE OF INFORMATION" for parts (a), (b), and (c) may have a 
small negative value* Such negative values should be inter- 
preted as 0.0 , since they are, for the most part, a result 
of rounding errors. 
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Formulas for Individual Subject S cores 

The following is a list of the formulas for the 102 
Individual Subject Scores reported by DEC-TEST. There 
are, of course, ^ number of algebraically equivalent 
expressions for each of the equations listed here. For 
the most part, the actual equations provided are the ones 
actually used in programming DEC-TEST; however, these 
particular algebraic expressions may not always provide 
the most intuitively appealing definition of the variables. 
Thus, the user may wish to re-structure the algebraic- 
expression for certain equations. 

The formulas provided are listed in a sequential 
manner, according to the variable numbers; i^e., VAR(l), 
VAR(2), VAR(102). However, the user should note 

that a particular VAR may be a function of a subsequently 
defined VAR; for example, VAR(64) is a function of VAR(67). 
This slight inconsistency is merely a result of the 
particular numbering scheme used to identify variables. 

For the most part, the notational scheme used in the 
following formulas has already been introduced in 
Section I. The user should, however, note the following 
additions and minor modifications: 

(a) the student subscript identifier, h, is dropped, 
since all formulas provide scores for one subject; 

(b) the limits for subscripts i (i = 1 , 2 , . . . , K) 
and j (j=l, 2, ...,n.) are not specified, since they 
remain constant; ^ 

(c) i' (i' =1, 2, . . . , K) is used as an additional 
item subscript; 

(d) H is used as a general purpose subscript, which 
is defined and/or given appropriate limits each time 

it is used; 

(e) "INT" means "integer value"; 

(f) "ABS" means "absolute value"; 

(g) "EXP" means "exponential"; 

(h) "ATAN" means "arc-tangent expressed in radians"; and 

(i) "*" is used as a multiplication operator as well 
as the indicator for correct answer. 
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VAR(l) s Score on Additional Student Variable 
VAR(2) « 0.0 



VAR(3) s E w. Cl.O - VAR(6)] / Z w.n. 

i i i ^ i 



VAR(1) = S Cl.O - VAR(7)] / E w^n^ 



VAR(5) = 1.0 



VAR(6) = 



Z w.P.^ - liZ w.)^ / E w.n.] 
.11* .1 .11 



VARC') = 



E(w. E P?.) - [(E w. )^ / E w.n.] 



VAR(99) - C(E w.)^ / E w.n.] 
, X i i 1 



E w. - C(E w.)^ / E w.n.] 
i i i 



{180 
ATAN 



r 180 

VAR(9) = ABS-J ATAN 



[1.0 - VARCe)*]] 
1.0 + VAR(6)Jj 

• 

Tl.O - VARC?)!"!^ 
[l.O + VAR(7)J/ 



VAR(IO) = AB 



VAK(ll) = INTCVAR(8)] 



{180 rVAR(6) - VAR(7) 1 
ATAN 1 
TT L^.O + VAR(6)*VARC7)J 



VAR(12) = INTCVARO)] 
O . A-65 



VAR(13) = iNTCVARdO)] 

VAR(m) = INT{CVAR(8) - VAR(11)]*60 + 0.5} 

VARdS) a INT{[VAR(9) - VAR(12)]*60 + 0.5} 

VAR(16) = INT{[VAR(10) - VAR(13)]*60 + 0.5} 

VAR(17) to VAR(26) = weighted number of times observed 

probability in given interval was 
used by student 

VAR(27) to VAR<36) = weighted number of times probabil- 
ities in given interval were asso- 
ciated with correct answer 

VAR(36 + Z) = VAR(26 + I) / VAR(16 + A) , ( Z = 1 , 2 , . . . , 10 ) 



10 

VAR(U7) = Z VAFv(16 + t){[VARf 2)+VAR(5)]*C(Z/10)-0.05] 
Z = l 

- VAR(36+Z)}^ / Z w.n. 

i ^ ^ 

10 

VAR(U8) = E VAR(16 + Jl){[VAR(3)+VAR(6)]*[(£/10-0.05] 
4 = 1 



- VAR(36 + ;.)}^ / Z w.n. 



1 - ^ 



10 

VAR(U9) = Z VAR(16 + Jl){[VAR(t+)+VAR(7)]*[(;,/10-0.05] 
£ = 1 



- VAR(36 + Jl)}^ / Z w.n. 

i ^ ^ 



VAR(50) = 



[E(w. I ?..log2P. .)]*[!: w.log (n.)] 
i j i 

I w^{(n^o^+ej^)[log2(n^a^ + Dj^) - logjCn^)]} 



where o, = VAR(3) 
n 

and e, = VAR(6) 
O h 
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VAR(51) s VAR(52) - VAR(50) 
VAR(52) s E w.log,(n.) 
VAR(53) = -E Cw. E P, . log. C?. . ) 3 
VAR(5«+) = VAR(55) - VAR(53) 
VAR(5£) = VAR(52) 

VAR(55 + £) = VAR(»+9 + Jl) / £ , (1 = 1, 2, 6) 

i 

VAR(62) = {[VARCSU) - VAR(51)] / VAR(55) }*100.0 
VAR(63) = VAR(65) - VAR(68) 
VAR(6U) = VAR(67) - VAR(63) 

VAR(65) = E {w. CA^log(P.*) + B^]} 
i 

VAR(66) = VAR(68) - VAR(6U) 
VAR(67) = VAR(69) - VAR(68) 
VAR(68) = E {w. CA.log(P.^) + B.]} 

1 i 2. 1 

VAR(69) = E w^B^ 
i 

VAR(70) = E w. {E P.. CA.logCP..) + B.]} 

ID 

VAR(71) = E [A^logCl. + B^] 

VAR(71 + A) = VAR(62 + / E , ( i = 1 , 2 , . . . , 9 ) 
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VAR(96) = 1.0 



VAR(97) = 10.0 EXP{[Z w. Z P..log(P..)] / i: w . , } 

i 1 j 13 ID il l 

= n n {p.. EXP Cw.p.. / r w. ,) 

i j 13 1 13 ^, 1 

VAR(98) = n [n. EXP (-w. 7 E w.,)] 
.1 1 1 

VAR(99^ = estimated weighted number of items correct 
if student forced to respond to each item 
with one and only one choice of correct 
answer. (See Section I.) 



VARdOO) = estimated weighted elimination score for 
test. (See Sectxon li ) 



VAR(lOl) = unweighted number of validity checks. 
(See section III DCT) 



VAR(102) = unweighted number of items scored. 

If MSD = 0 , VAR(102) = K ; 
If MSD = 1 , VAR(102) <= K ; 
i.e., an item is not scored 
if the input responses for all 
alternativec equal "XMS", the 
code for missing data. 



VI. Technical Data and Information 



Structure of DEC-TEST 

DEC-TEST consists of a MAINLINE program, 16 sub- 
routines, and a BLOCK DATA subprogram. Table A. 6 lists 
selected technical characteristics of each program unit. 
The principal function of MAINLINE is the assignment , of 
user-defined values for modifiable assignment statements 
and modifiable dimension statements. INPUT serves as the 
principal program unit (subroutine) for reading control 
cards and student data, as well as for branching to other 
subroutines . 

As indicated in Table A-G, DEC-TEST requires 12U,29i+ 
bytes of main storage if no overlay structure is used. 
If the overlay structure indicated in Table A*6 and 
Figure A-7 is used, then DEC-TEST requires 70094 bytes of 
main storage; i.e., DEC-TEST requires the number of bytes 
necessary to store Segment 1 and Segment 5. Thus, the use 
of the overlay structure saves 124,294 - 70,094 = 54,200 
bytes. However, these figures do not include:- (a) bytes 
required for user-defined matrices and vectors and 
(b) additional bytes (overhead) required by FORTRAN for 
execution of DEC-TEST. 

User - Modifications of DEC-TEST 

Figure A-4 provides a partial listing of the 
MAINLINE program for DEC-TEST. Both the modifiable dimen- 
sion statements (MAI 7 to MAI 19) and the modifiable 
assignment statements (MAI 3 0 to MAI 33) can -be altered 
by the user prior to compilation of DFC-TEST. Figure A-5 
provider a worksheet for making such changes and deter- 
mining the total number of bytes required by DEC-TEST. 
Note that, in order to execute DEC-TEST, the user needs 
additional bytes (overhead) required by FORTRAN; for this 
purpose, in most cases, 10,000 bytes should be more than 
sufficient . 

In Figures A*4 and A-5 

NDIM = maximum number of students for a run, 

KDIM = maximum number of items for a run, and 

lADIM s maximum number of responses (alter- 
natives) for a student. 
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TABLE A- 6 
Structure of DEC-TEST 



Program 
Unit 


Abbre-j^ 
viation 


ment 


No. 
Eytes 


No. of 
Source 
Statements 


No. 

of 

Cards 


MAINLINE 


MAT 


1 


760 


32 


53 


INPUT 


INP 


1 


22516 


382 


524 


PAGER 


PAG 


1 


U2U 


11 • 


13 


STDV 


STD 


1 


472 


8 


10 


C0RR 


C0R 


1 


546 


7 


10 


SEM 


SEM 


1 


U70 


8 


10 


BL0CK DATA 
other 


BLK 


1 J 


' 2599U 


8 
0 


10 
0 


C0VER 


C0V 


2 


2500 


70 


83 


I0SDI 


I(5S 


3 


2768 


58 


65 


SETUP 


br. i 




4938 


ISO 


165 


SC^RE 


SC0 


5 


18912 


460 


493 


UML 


UML 


6 


2352 


63 


75 


lADCT 


IAD 


7 


10292 


258 


295 


lAELIM 


lAE 


8 


6100 


163 


177 


lACLAS 


I AC 


9 


4400 


111 


123 


SUMRY 


SUM 


10 


3528 


" 94 


108 


SXITEM 


SXI 


11 


8994 


257 


277 


RELIAB 


REL 


12 


8328 


244 


268 






Totals : 


124,294 


2384 


2769 



e ^ 



All "program units'* are subroutines except for 
MAINLINE, BLeCK DATA, and "other. Other" includes 
FORTRAN supplied subroutines, functions, , etc. 

required by DEC-TEST. 

^These abbreviations are found inr columns 72-74 of 
the source deck for DEC-TEST. Each card in the source 
deck is uniquely identified by the appropriate abbre- 
viation followed by a sequential (within subroutine) card 
number in columns 76-60 • 

The segment numbers refer to the overlay structure 
for DEC-TEST. During program execution, if the user employs 
the overlay structure^ then me\n storage contains Segment 1 
(root segment) and one of the Segments 2-12. 

^The number of bytes required by user-defined matrices 
and vectors is not included here- The nximber of bytes for 
"other" includes FORTRAN supplied subroutines, functions, 
etc. required by DEC-TEST. 

e 

"No. of cards" equals "no. of source statements" plus 
number of comment cards plus number of continuation cards. 
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FIGURE A- 5 

Worksheet for User - Defined Matrices , 
VectorsT and Assignment Statements 



NDIM = 



KDIM 



lADIM = 



LUCC 



No. of byt^s for user-defined matrices and vectors 
= NDIMCC+XIADIM) + 212] + (15^)(KDIM) + 182 
= CC+X ) + 212] + (16^)( ) + 182 

= bytes^ 



Variable 
Dimensions 



User 
Dimensions 



No. of 
Locations 



No. of Bytes 



X(NDIM,IADIM) 

Z(NDIM+2 ,20) 

R(NDIM,20) 

RIT<KDIM+1) 

T(KDIM,32) 

RS(KDIM) 

RT(KDIM) 

JT(KDIM) 



Y(NDIM,24) 
lYSRT(NDIM) 
IT(KDIM+1,9) 
MS (KDIM) 
IDDM(NDIM) 



X( 


) 


Z( . 




R( 


, 20) 


RiT( ; 




T( 


,32) 


RS( ; 




RT( : 




JT( : 





Y( 

IYSRT( 
IT( 
MS( 
IDDM( 



Subtotal-1 
9) 

Subtotal-2 



X U = 



X 2 



No. of bytes for user-defined matrices and 
vectors (Subtotal-1 + Subtotal-2) 

Number of bytes required 

by DEC-TEST using overlay (70094 bytes) 
or not using overlay (124294 bytes) 

Tot^l 



bytes 



bytes 
bytes' 

bytes 
bytes 



These two values should be identical. 
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These are the only three variables required to define 
matrix and vector dimensions. Note that each element of 
the first eight matrices (or vectors) is a real variable 
occupying four bytes of main storage; while each element 
of the last fivf^ matrices (or vectors) is an integer 
variable occupying two bytes of main storage. Thus, the 
first eight matrices (or vectors) are associated with 
DIMENSIIDN statements; while the last five matrices (or 
vectors) are associated with INTEGER*2 statements. 
Also, note that the number of elements (or locations) 
for a matrix is the product of its dimensions. 

As indicated at the beginning of Section II, LUCC, 
the logical unit for reading (most of) the control 
cards, may be altered in the MAINLINE program. 

Figure A-6 provides an example of the worksheet in 
Figure A^5. For this example, the total number of bytes 
required to execute DEC-TEST, using the overlay, is about 
13798^+ + 10000 = 147984. 

JCL for DEC-TES T at SUSB 

Figure provides a listing of the JCL "(Job 

Control Language) statements necessary to compiTe , link- 
edit, ancT execute DEC-TEST at the SUSB (State University 
of New York at Stony Brook) Computing Center. At SUSB 
logical unit numbers 5, 6, and 7 are defined in the 
catalogued proc<=^dure for F0RTGCLG as logical units for 
reading punched cards, printing, and punching, respec- 
tively. Any other required logical unit must be defined 
by the user with a //G0.FT ... statement (see Fortran 
Programmer's Guide or JCL Manual). 

If (in addition to compiling, linkeditirig, and 
executing DEC-TEST) one wanted to store a DEC-TEST load 
module (say D5950) in a catelogued dataset (say TESTAID) 
on a disk (say USEROl), then the following statement 
would be placed immediately before the //LKED.SYSIN DD * 
card in Figure A-'? : 

//LKED.SYSLM0D DD DSN=USER. TESTAID(D5 9 5 0 ) ,DISP= (0LD ,KEEP ) , 
// SPACEii(TRK,(5 ,5,2) ,RLSE) ,V0L=SLR=USERO1 ,UNIT = 3 3 30 
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FIGURE A-6 

Example of 
Worksheet for User-ueTTned Matrices > 
Vectors , and Assignment Statements 



NDIM = _59i KDIM = _5£ lADIM = 200. LUCC = 5 

No. of bytes for user-defined matrices and vectors 
= NDIM[(U)(IADIM) + 212] + (16H)(KDIM) + 182 

= 59C(U)( 200 ) + 212] + (164)( 50 ) + 1^2 

= 68090 bytes^ 



Variable 
Dimensions 



User 
Dimensions 



No. of 
Locations 



No. of Bytes 



X(NDIM,IADIM) 

Z(NDIM+2,20) 

R(NDIM,20) 

RIT(KDIM+1) 

T(KDIM,32) 

RS(KDIM) 

RT(KDIM) 

JTCKDIM) 


X( 59,200) 
Z(~^,~5ir) 
R(T7, 20) 
RITCTT) 
T(T13',32) 

RS("Tir) 

FT("Tir) 

JTCTP") 


11800 
1220 
1180 
. 51 
1600 
50 
50 
50 




Subtotal-1 


16001 X 4 = 64004 


Y(NDIM,2U) 
lYSRT(NDIM) 
IT(KDIM+1,9) 
MS (KDIM) 
IDDM(NDIM) 


Y( 59, 24) 
iySRT(3T) 

IT(TT, 9) 
MSCTCT) 
IDDM(T5") 


1416 
59 

1*59 
50 




Subtotal-2 


2043 X 2 P 4086 



No. of bytes for user-defined matrices and 
vectors (Subtotal-1 + Subtotal-2) 

Number of bytes required 

by DEC-TEST using overlay (70094 bytes) 
or not using overlay (124294 bytes) 

Total 



68090 bytes^ 



70094 bytes 
138184 bytes 



These two values should be identical. 
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FIGURE A-7 

JCL to Compile , Linkedit , and Execute DEC-TES T with Overlay 



1 3 5 7 9 



111 
13 5 



12 2 
9 13 



2 2 2 
5 7 9 



3 
1 



3 3 
5 7 



> 



Card Columns 



// (job card) 

// (account card) 

// EXEC F0RTGCLG,PARM.LKED='(»VLY' 
//F0RT.SYSIN DD * 

(source deck) 

/* 

//LKED.SYSIN DD * 
ENTRY MAIN 

INSERT MAIN , INPUT , PAGER , STDV ,C(8RR , SEM 

OVERLAY ALPHA 

INSERT COVER 

OVERLAY ALPHA 

INSERT lOSDI 

OVERLAY ALPHA 

INSERT SETUP 

OVERLAY ALPHA 

INSERT SCORE 

OVERLAY ALPHA 

INSERT UML 

OVERLAY ALPHA 

INSERT lADCT 

OVERLAY ALPH,-. 

INSERT lAELIM 

OVERLAY ALPHA 

INSERT lACLAS 

OVERLAY ALPHA 

INSERT SUMRY 

OVERLAY ALPHA 

INSERT SXITEM 

OVERLAY ALPHA 

INSERT RELIAB 

//GO. FT FOOl DD SYSOUT=--- 

//GO. FT FOOl DD SYS^UT= • • • 
//G^-FT^FOGl DD * 



(DEC-TEST control cards and student data) 



Overlay — These 
statements in 
conjunction with 
>PARM.LKED='OVLY' 
on EXEC card 
accomplish the 
overlay . 




FT cards — Positions 
underlined should 
be filled in with 
logical unit numbers 
required for user's 
run of DEC-TEST 



/* 
// 
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Once this is done, the user can execute DEC-TEST using 
the following JCL Ptatements: 



// (job card) 

// (account card) 

// EXEC PGM=D5950 

//STEPLIB DD DSN=USER.TESTAID,DISP=^LD 
//FT FOOl DD SYS0UT=- • • 

//FT FOOl DD SYSaUTs- • • 
//FTBTFOOI DD * 

(DEC-TEST control cards and student data) 

/* 

// 
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APPENDIX B 
Test Items Used for This Studv 



The following is a list of test items used in this s-^udy- 
An beside an alternative indicates the correct answer. 

The following identification scheme for items is employed: 

(a) Items identified as ACOl to AC25 are criteidon- 
referenced items for test A, which was used as a 
pretest for some subjects, as a posttest for other 
subjects, and as both a pre- and posttest for still- 
other subjects. 

(b) Items identified as BCOl to BC25 are criterion- 
referenced items for test B, which was used a a 
pretest for some subjects, as a posttert for other 
subjects, and as both a pre- and posttest for still 
other subjects. Note that ACyy is intended to be 
equivalent to BCyy, where "yy" is any item number 
(01 to 25). 



(c) Items identified as ZC26 to ZC50 are criterion- 
referenced items which all subjects took in the 
posttest mode, only. None of the ZC items are 
intended to be equivalent to any AC or BC item. 

AC Items 

ACOl Objectives have not been djefined for which of the following 
domains? 

A. affective 

B, cognitive 

* C. objective 

D. psychomotor 

AC02 Which of the following terms is least accep":able for 

instructional objectives which are to be measured through 
multiple-choice test items? 

A. recognize 

B. differentiate 

C. identify 

* D. list 

AC03 Which of the following is most correct? Instructional' 
ob j ecti ves should : 

A. be stated in terms of teacher behavior 

B. end with an active verb 

C. relate "^o one or two processes onlv 

* D. represent intended direct outcomes of learning 

experiences 
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ACOU Which of the following is not correct? "Standardized" 
tests provide a standard for: 

* A, exceJlence 

B. timing 
scoring 

D. administration 

AC05 The word ^criterion'" in "criterion-referenced test" 
usually refers to: 

A- a cut-off value, such as 85% correct 

* B. a set of objectives 

C. some type of norms 
another test 

AC06 The assignment of numerals to objects or events according 
to rules is a difinition of : 

* A, measurement 

B, evaluation 

C, Testing 

D, Validation 

AC07 In educational measurement, the underlying scale of 
measurement is usually: 
A- nominal 

* B, ordinal 

interval 
D, ratio 

AC08 The percentage of students who get an item- correct is 
called: 

* A. difficulty level 
B- error rate 

C. theoretical difficulty level 

D. theoretical ez^ror rate 

AC09 A statistic used to sh ^'ply an item differentiates 

between the students v;ho scored highest on a test and the 
students who scored lowest is called a (an) : 

A, difficulty level 

B. error rate 

* C. discrimination index 
D. out-off value 

ACIO Which of the following is not a possible value of the 
standard deviation of a set of scores? 

A, 0-00 

B. 100-03 
C- 0.01 

* D. -1.00 
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' ACll The 50th percentile is also called the: 

A, standard deviation 

B, mean 

C, semi- interquartile range 
* median 



AC12 Which of the following is not a possible value for the 
Pearson product-iAoment correlation coefficient? 

A. 0.00 
B- 0.50 
C. -1.00 

* D. 1.25 

AC13 If test scores are distributed normally, what percent 
of the scores will exceed a score falling one standard 
ceviation below the mean? 
A- 16% 
B- 34% 
C. 68% 

* D- 84% 

AC14 Which of the fallowing estimates of reliability is most 
clearly assiciated with test homogeneity? ' 

* A. Kuder-Richardson 

B. Test-Retest 

C. Equi valent-Tes ts 

D. Split-halves 

AC15 Which of the following estimates of reliability is most 
clearly associated with the Spearman-Brown Prophecy 
Formula? 

A. Kuder-Richardson 

B. Test-Retest 

C. Equivalent-Tests 

* D. Split-halves 

AC16 The average score that a person would make over repeated 
trials on the same test is bis: 
A. reliability 
*-B. true score 

C. obtained score 

D. error variance 

AC17 The standard deviation of the distribution of error 
. scores is called the: 

A. reliability of the test 

B. reliabili ty error 

C. standard error of estimate 

* D. standard error of measurement 

AC18 The extent to which a test truly represents the area of 
knowledge under consideration is its : 
A. face validity 

* B. content validity 

C. criterion validity 

D. construct validity 
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AC19 The type of validity that is particularly relevant when 
evaluating p^rsonali'ty measlires is : 
A. content ^'alidity 

* B. construct validity 

C. criterion-related validity 

D. face validity 

AC20 If raw scores on a test are normally distributed, the 

greatest difference in raw score points will be between 
percentile ranks: 

* A. 1 and 5 
B- 25 and 30 

C. 50 and 55 

D. 90 and 95 

AC21 Ernest scored at the 99th percentile on the entrance test 
at Cascade College. The best interpretation of his score 
is : . . 

A. Ke should obtain higher grades than 99 percent of 
the students. 

B. He should obtain high grades with relatively little 
effort. 

C. His score is comparable to an IQ of about 130. 

* D. He scored higher than 99 percent of the students 

taking the test. 

AC22 Which of the following scores is expressed in r^w score 
units? 

A. stanines. 

* B. percentile points 

C. normalized standard scores 

D. percentile ranks 

AC2 3 The mean and standard deviation of the distribution of 
z scores are, respectively: 

* A. 0 and 1. 

B. 10 and 3 

C. 50 and 10 
.D. 100 and 15. 

AC2U The first question to be asked when evaluating a standard- 
ized achievement test is: 

A. What is the editorial quality of the test? 

* B. What does the test measure? 

C. How reliable is the test? 

D. Are equiva.l pnr forms of the test available? 

AC25 If a test is valid, then the test: 

A. scores must be normally distributed 

* B. must be reliable 

C. must be relatively long 

D. must nave national norms 



ERIC 



B-4 



BCOl Objectives have not been defined for whic^i of the .folloving 
domains? 

* A. psychological * ' . 

B. cognitive 

C. psychomotor 

D. none of the above 

BC02 Which of the following terms is least acceptable for 

instructional objectives which are to be measured through 
multiple-choice test items? 
A. recognize 

* B. recall 

C. identify 

D. differentiate 

Which of the following is most correct? Instructional 
objectives should : 

A. end with an active verb 

B. be stated in terms of teacher behavior 

C. start with an active verb 

D. be stated in terms of student behavior 
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BCOU Which of ttia following is not correct? "Standardized 
tests provide a standard for: 

A. timing 

B. scoring 

* C. precision 

D. administration 

BC05 The word ^'criterion* in 'criterion-referenced test' 
usually refers to: 

* A* a stiL of objectives 

B. criterion validity 

C. students* scores on a previous test 

D. a standardized achievement test 

BC06 The assignment of numerals to objects or events according 
to rules is a definition of: 
A. evaluation 
-B. testing 

* C. measurement 
D. statistics 

BC0 7 In educational measurement, ^^e usually assume that the 
scale of measurement is : 

A. nominal 

B. ordinal 

* C. interval 
D. ratio 



BC08 The number of students who get an item correct divided 
by the total number of students is called: 
* A. difficulty level 

B. error rate 

C. theoretical difficulty level 

D. theoretical error rate 

ERIC 



BC09 A statistic used to show how sharply an item differentiates 
between students who score high on a test and students 
who score low is called a: * . 
* A. discrimination index 

B. standard deviation 

C. correlation coefficient 
standard error of measurement 



BCIO Which of the following is not a possible value of the 
standard deviation of a set of scores? 
A, lO^Ol 
B- 0,00 

* C. -O^Ol 
D. 201.00 

BCll The median is also called the : 
A. variance 

* B. 50th percentile 

C. semi-interquartile range 

D. mode 



BC12 Which of the following is not a possible value of the 
Pearson product-ipoment correlation coef fi(;:ient? 

A. -0.0001 

B. -1.0000 

C. 0.9999 
* D. 1.0001 



BC13 If test scores are distributed normally, what percent 
of the scores will exceed a score falling one standard 
deviation above the mean? 
* A. 16% 

B. 3i+% 

C. 68% 
D. 



BC14 Which of the following estimates 
clearly associated with internal 
A. split-halves 
- B. equivalent-tests 

* C. Kuder-Richardson 
D. test-retest 

BC15 Wriich of The following estimates of reliability typically 
employs the Spearman-Brown Prophecy Formula? 

A. Kuder-Richardson 

B. parallel-tests 

* C. Split-halves 

D. coefficient alpha 

BC16 The average score that a person would receive over 
repeated administrations of the same test is his: 

A. theoretical score 

B. observed score 

* C. true score 
D . error score 



of reliability is most 
consistency? 
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BC17 The standard deviation of the distribution of error 
scores is called the.: 

A. standard deviation 

B. reliability of the test 

C. error deviation 

* D. standard error of measurement 

BC18 The type of validity most appropriate for achievement 
tests is : - • 

A. face validity 

* B. conte'nt validity 

C. construct validity 

D. criterion validity 

BC19 The type of validity most clearly assoCiiated with 
theories of personality is : 

A. face validity 

B. content validity 

C. criterion validity 

* D. construct validity 

BC20 If raw scores are normally distributed, the greatest 

difference in raw score points will be between percentile 
ranks : 

* A. 1 and 5 

B. ^8 and 52 

C. 70 and 7^4 

D. 9^4 and 98 

BC21 Jerry scored at the 75th percentile on the SAT-Mathematics 
. test. The best interpretation of his score is: 

* A. he scored higher than 75 percent of the students 

who took the test. 

B. he scored higher than 2 5 percent of the students 
who took the test. 

C. his IQ is above average 

D. his IQ is below average 

BC22 Which of the following scores is expressed in raw 
' score units ? 

A. T-scores 

B. Z-scores 

* C. percentile points 
D. percentile ranks 

BC23 The mean and standai^d deviation of z-scores are respectively: 

A. 5 and 2 

B. 50 and 10 

C. 100 and 16 

* D. none of the above 

BC2^ The most important aspect of a standardized achievement 
test is its : 

* A. content 

B. reliability 

C. editorial quality 

D. cost 
O B-7 
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BC25 If a test is valid, then the test must be 

A. relatively long 

* B. reliable 

standardized 
D. normally distributed 

ZC26 Which is the best example of a free-response test? 

* A. an essay test 

B. a matching test - 

C. a multiple-choice test 

D. a short-answer test 



ZC27 In a pure power test, test takers would not differ in the: 

* A. number of items attempted . 

B. number of items answered correctly 

C. percent of items answered correctly 

D. time taken to complete the test 

ZC28 Which of the following types of tests is least apt 
to be used in order to rank order students? 

A. norm-r3f erenced test 

B. standardized test 

* C. criterion-referenced test 
D. non-standardized test 



ZC29 In educational measurement, we usually tacitly assume 
that the underlying scale of measurement is: 

A. nominal 

B. ordinal 
* C. interval 

. D. ratio 



ZC30 The percentage of students that we expect will get an 
item wrong if everybody guesses blindly is called: 

A. difficulty level 

B. error rate 

C. theoretical difficulty level 
* D. theoretical error rate 

ZC31' Which of the following is a positively skewed distribution? 




ZC32 In which of the following distributions do the mean, 
median , and mode always coincide? 
* A. Normal 

5. Positively skewed 
C Bimodal 
D. J-shaped 
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ZC33 Which of the following is not. an index of the dispersion 
of a set of test scores? 

* A. the mean 

B. the standard deviation 

C. the variance 

D. the range 

ZC34 An index measuring the degree of relationship between 

two different measures for a- group of individuals is a: 

* A. correlation coefficient 

B. standard error of measurement 

C. discrimination index 

D. standard deviation 



ZC35 If high values of X are associated with low values of Y, 
and low values of X are associated with high values of Y, 
then : . 

A. = 0 

B. r^y> 0 

* C. r^y< 0 

D. r^^y is undefined 

ZC36 Which of the following statements is true? 

A. Different methods of computing a reliability co- 
efficient will give the same result. 

B. Very hard tests generally have higher reliabilities 
than very easy tests. 

* C. A longer test is generally more reliable than a 

shorter one. 

D. Older tests, which have been used more, are generally 
more reliable than newer ones. 

ZC37 Which of the following estimates of reliability takes 
into account the most sources of variation? 
-A. Test-Retest without time interval intervening. 

B. Test-Retest with time interval intervening. 

C. Equivalent-Tests without time interval intervening. 

* D. Equivalent-Tests with time interval intervening. 

ZC38 The reliability of a test refers to: 

A. how accurately the test measures the ^trait it is 
des igned to measure . 
^ B. the precision with which the test measures whatever 
it meas ures . 

C. how accurately the test categorizes people into 
defined groups. 

D. how much faith you can put in the test scores. 
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ZC39 Mary has taken an intelligence test during each of the 
last three years. Her scores on successive testings 
were 121, 118, and 114. The most reasonable explanation 
of these results is : 

A. Her scores are dropping as competition gets rougher 
at older age levels. 

B. A personal problem is probably interfering with her 
performance. 

C. The test used is not good as it does not measure 
consistently . 

* D. The scores are within the range expected on repeated 

testings. 

ZC^O Which of the following is most useful for estimating a 
person's true score? 
A, the reliability of the test, 

* B . the standard error of measurement 

C. the mean of the test 

D. the stvindard deviation of, the test 

ZCm Scores on a final exam in introductory psychology are 
correlated with scores on a well-known norm- referenced 
test in psychology. The resulting correlation coefficient 
is evidence of the final exam's: 
A, face validity 

content validity 

* C. criterion validity 
D. construct validity 

ZCU2 Which of the following is not an essential requirement 
of a criterion measure? 

A. Measure an important component cff the task 

B. measures reliability 

* C. measures more than one behavior 
D. is free from bias 

ZCU3 The weakest link in most validity studies is the: 
A. predictors 

* B. criterion 

C. sample 

D. validation technique 

ZC^^ The coirelations between predictors and performance in 

an auto mechanics lab are : compulsivity , + . 26 ; mechanical 
comprehens ion , +.33; intelligence , +,05; English grades , 
-.U3. The best predictor of performance in the lab is: 

A. compulsi vi ty scores 

B. mechanical comprehension 

C. intelligence 

* D. English grades 
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ZC^S If J when predicting college grades from a college ad- 
missions test* r = 50 , we can say that: 
» xy ' ^ 

* A. 25 percent of the variance in grades is predictable 

from the test scores. 

B. 5.0 percent of the variance in grades is predictable 
from the test scores. 

C. Using the test will reduce prediction errors by 
50 percent. 

D. Predicting from the test will be 25 percent better 
than chance predictions. 

ZC^6 A test has a mean of U5 and a standard deviation of 10 
points. If the distribution of scores is approximately 
normal, the range of scores in a class of 50 students 
would be from approximately: 

A. ^0 to 50 

B. 35 to 55 

C. 25 to 65 

* D. 15 to 75 

ZCU? Norms are most useful for: 

A. selecting the best qualified workers 

* B. comparing a person to his immediate competitors. 
C. studying the extent of individual differences. 
C. computing the validity of a test. 

ZC^d Which of the following are least useful for evaluating 
college students? 

* A. age norms . • 

B. grade norms 

C. percentile norms 

D. standard scores 

ZCi+9 Consistency of measurement is often called: 

* A. reliability 

B. validity 

C. variance 

D. skewness 

ZC50 In preparing students to take standardized achievement 
batteries, the teacher should: 

A. drill the students on the material to be covered on 
the test. 

* B. briefly explain the nature and purpose of the test. 

C. keep reminding the student^ how important the test 
will be. 

D. say nothing in advance of the test so students will 
not become anxious. 
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